I'm debugging code and to debug I need to load database tables. This is the majority of time spent for each run. Is there a way in Pycharm to cache these results so that I can debug faster, rather than reloading the tables every time I refresh the debug?
You can store (a subset of) the data in a pickle, or text file and read from that file instead of the database when debugging.
Dependent on what datatype you use to store the database data (you don't provide any code or a minimal example) you can use different methods to store a pickle file, here are two:
pandas.DataFrame.to_pickle
Pickle a dict
A more complex and general method would be to use a real cache like memcached with your python script. This article describes how.
The article is too comprehensive to reproduce here, so I would recommend to read it there.
Related
I'm looking for fast and efficient data storage to build my PHP based web site. I'm aware of MySql. Can I use a JSON file in my server root directory instead of a MySQL database? If yes, what is the best way to do it?
You can use any single file, including a JSON file, like this:
Lock it somehow (google PHP file locking, it's possibly as simple as adding a parameter to file open function or changing function name to locking version).
Read the data from file and parse it to internal data stucture.
Optionally modify the data in internal data structure.
If you modified the data, truncate the file to 0 length and write new data to it.
Unlock the file as soon as you can, other requests may be waiting...
You can keep using the data in internal structures to render the page, just remember it may be out-dated as soon as you release the file lock and other HTTP request can modify it.
Also, if you modify the data from user's web form, remember that it may have been modified in between. Like, load page with user details for editing, then other user deletes that user, then editer tries to save the changed details, and should probably get error instead of re-creating deleted user.
Note: This is very inefficient. If you are building a site where you expect more than say 10 simultaneous users, you have to use a more sophisticated scheme, or just use existing database... Also, you can't have too much data, because parsing JSON and generating modified JSON takes time.
As long as you have just one user at a time, it'll just get slower and slower as amount of data grows, but as user count increases, and more users means both more requests and more data, things start to get exponentially slower and you very soon hit limit where HTTP requests start to expire before file is available for handling the request...
At that point, do not try to hack it to make it faster, but instead pick some existing database framework (SQL or nosql or file-based). If you start hacking together your own, you just end up re-inventing the wheel, usually poorly :-). Well, unless it is just programming exercise, but even then it might be better to instead learn use of some existing framework.
I wrote an Object Document Mapper to use with json files called JSON ODM may be a bit late, but if it is still needed it is open source under MIT Licence.
It provides a query languge, and some GeoJSON tools
The new version of IBM Informix 12.10 xC2 supports now JSON.
check the link : http://pic.dhe.ibm.com/infocenter/informix/v121/topic/com.ibm.json.doc/ids_json_007.htm
The manual says it is compatible with MongoDB drivers.
About the Informix JSON compatibility
Applications that use the JSON-oriented query language, created by
MongoDB, can interact with data stored in Informix® databases. The
Informix database server also provides built-in JSON and BSON (binary
JSON) data types.
You can use MongoDB community drivers to insert, update, and query
JSON documents in Informix.
Not sure, but I believe you can use the Innovator-C edition (free for production) to test and use it with no-cost either for production enviroment.
One obvious case when you can prefer JSON (or another file format) over database is when all yours (relatively small) data stored in the application cache.
When an application server (re)starts, an application reads data from file(s) and stores it in the data structure.
When data changes, an application updates file(s).
Advantage: no database.
Disadvantage: for a number of reasons can be used only for systems with relatively small data. For example, a very specific product site with several hundreds of products.
I would like to know if a pure node.js web app can be developed, which means very simple deployment. From my understanding since node.js is good at i/o, a database in node.js should be good too. Does one exist? Especially one that lives in RAM and occasionally persists to disk.
First of I don't see the problem in installing redis or mongodb. It can be done without any effort at all.
That said there are a number of such databases like:
ministore: save at specified intervals.
alfred: Reads are fast because indexes into files are kept in memory.
nStore: Also a index of all documents and their exact location on the disk is stored in in memory for fast reads of any document.
jsonds: Jsonds is a 'data store' which is just a JSON object which is written to disk at a set frequency.
supermarket
chaos
node-dirty
node-tiny
nedb: Embedded pure JS database with MongoDB-compatible API.
Also most of these product are very young and should probably not be used in production yet.
You could also code something yourself I assume using node-sqlite3 to store data back to disc.
If you want a database in Node that exists only in ram you could simply use javascript objects and arrays to contain your data. If you need something more powerful with queries that ressemble SQL, then maybe pure javascript objects would not be the best idea. Also, with this idea you could make it persistant by flushing the data to disk using JSON.stringify at a set interval.
Try looking here: https://github.com/joyent/node/wiki/modules#database
Sorry for the short answer guys.
I am collecting lots of data from lots of machines. These machines cannot run PostgreSQL and the cannot connect to a PostgreSQL database. At the moment I save the data from these machines in CSV files and use the COPY FROM command to import the data into the PostgreSQL database. Even on high-end hardware this process is taking hours. Therefore, I was thinking about writing the data to the format of PostgreSQL database directly. I would then simply copy these files into the /data directory, start the PostgreSQL server. The server would then find the database files and accept them as databases.
Is such a solution feasible?
Theoretically this might be possible if you studied the source code of PostgreSQL very closely.
But you essentially wind up (re)writing the core of PostgreSQL, which qualifies as "not feasible" from my point of view.
Edit:
You might want to have a look at pg_bulkload which claims to be faster than COPY (haven't used it though)
Why can't they connect to the database server? If it is because of library-dependencies, I suggest that you set up some sort of client-server solution (web services perhaps) that could queue and submit data along the way.
Relying on batch operations will always give you a headache when dealing with large amount of data, and if COPY FROM isn't fast enough for you, I don't think anything will be.
Yeah, you can't just write the files out in any reasonable way. In addition to the data page format, you'd need to replicate the commit logs, part of the write-ahead logs, some transaction visibility parts, any conversion code for types you use, and possibly the TOAST and varlena code. Oh, and the system catalog data, as already mentioned. Rough guess, you might get by with only needing to borrow 200K lines of code from the server. PostgreSQL is built from the ground up around being extensible; you can't even interpret what an integer means without looking up the type information around the integer type in the system catalog first.
There are some tips for speeding up the COPY process at Bulk Loading and Restores. Turning off synchronous_commit in particular may help. Another trick that may be useful: if you start a transaction, TRUNCATE a table, and then COPY into it, that COPY goes much faster. It doesn't bother with the usual write-ahead log protection. However, it's easy to discover COPY is actually bottlenecked on CPU performance, and there's nothing useful you can do about that. Some people split the incoming file into pieces and run multiple COPY operations at once to work around this.
Realistically, pg_bulkload is probably your best bet, unless it too gets CPU bound--at which point a splitter outside the database and multiple parallel loading is really what you need.
What are the problems associated with storing your Data in files rather than databases? I'm thinking in terms of something like a blog engiene. I read that MoveableType used to do this. What are the pros/cons of working this way?
Databases provide means to perform interesting queries more easily.
Examples: You would want to list the 10 most recent posts on the front page. Make an archive page that lists all articles published in a given year (taken from the url).
I think the main one is data consistency. If you keep everything together in one db table, you don't have to worry (as much) about the file being externally modified or deleted without the meta data being modified in sync. There's also the possibility of an incomplete write if the server fails while you're updating. In this case you have to take your own steps to implement transactions.
I think that with an appropriate level of care and file permissions though, these problems can be overcome.
It is much easier and more comfortable to specify access rights (to data or file) in database than to use OS specific access rights.
You can easily share data across machines and/or websites using database-stored files.
Unfortunately, it is (often) much slower to serve files stored in database.
I'm trying to populate a table with user information in a MS SQL database with information from multiple data sources (i.e. LDAP and some other MS SQL databases). The process needs to run as a daily scheduled task to ensure that the user information table is updated frequently.
The initial attempt at this query/ update script was written in VBScript and would query each data source and then update the user information table. Unfortunately, this takes very long to run and update the user information table.
I'm curious if anyone has written anything similar and if you recommend or noticed a performance improvement by writing the script in another language. Some have recommended Perl because of multi-threading, but if anyone has any other suggestions on ways to improve the process or other approaches could you share tips or lessons learned.
It's good practise to use Data Transformation Services (DTS) or SSIS as it has become known for doing repetitive DB tasks. Although this won't solve your problem, it may give some pointers to what is going on as you can log each stage of the process, wrap it in transactions etc. It is especially well suited for bulk loading and updates, and it understands VBScript natively so there should be no problem there.
Other than that I have to agree with Brian, find out what's making it slow and fix that, changing languages is unlikely to fix it on its own, especially if you have an underlying issue. As a general point my experience when using LDAP, which is pretty small, was it could be incredibly slow reader bulk user details.
I can't tell you how to solve your particular problem, but whenever you run into this situation you want to find out why it is slow before you try to solve it. Where is the slow down? Some major things to consider and investigate include:
getting the data
interacting with the network
querying the database
updating indices in the database
Get some timing and profiling information to figure out where to concentrate your efforts.
Hmmm. Seems like you could cron a script that uses dump utils from the various sources, then seds the output into good form for the load util for the target database. The script could be in bash or Perl, whatever.
Edit: In terms of performance, I think the first thing you want to try is to make sure that you disable any autocommit at the beginning of the load process, then issue the commit after writing all the records. This can make a HUGE performance difference.
AS MrTelly said, use SSIS or DTS. Then schedule the package to run. Just converting to this alone will probaly fix your speed issue as they have tasks that are optimized for bulk inserting. I would never do this in a script language rather that t-SQl anyway. Likely your script works row by row instead of on sets of data but that is just a guess.