How do I research obscure file types? [closed] - file-format

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
A client has a large document-management system -- millions of TIFFs and PDFs and a fewer of other random files; images and other binaries. I'm converting formats, imprinting notes, reorganizing and redacting sensitive information when found. And that's all great for the vast bulk of the files.
But I occasionally find a new format and have to figure out what it is and how to handle it within the project's parameters. Usually this isn't too hard and when it has been, it's such a small handful that it doesn't matter too much if I just can't handle it. But right now, I have a larger handful of files that don't appear to have a sophisticated header but all start with "COM1.0" (43 4F 4D 31 2E 30).
So, I'd like help on two levels. What's a good way for me to research this (and others I might find in the future -- teach a man to fish, and all); when just Googling around fails me? And if you know what the file type is, I'd be keen to hear about it.

One specialist site is http://www.wotsit.org/ - there may be a few others. These give details when you can already identify the file format, though.
There are some more tips at http://www.garykessler.net/library/file_sigs.html
I did try doing a little searching and didn't turn up anything much, but I didn't try very hard.

Good luck, but remember that not every file format is documented outside of the company that created it; and, few companies publish their file formats before they go under.
Depending on how old these files are, the odds of hitting a brick wall are high unless you have a few extra hints to work with (like the name of the program the files are associated with).

Google
If google fails, it may be something specific for your customer.

Related

Is there a way to understand what algorithms does PostgreSQL use to implement relational operators without having to read the source code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
PostgreSQL makes use of intraoperation parallelism and that is of interest to me (for my undergad final year research project). I would like to know how operations like selection, projection, join, etc are parallelized, but when I tried to look at the source code, I got extremely overwhelmed. Is there a high-level PostgreSQL "map"?
I tried looking for books that discuss and explore the algorithms and implementations used in PostgreSQL, but unfortunately didn't find any. Though feel free to refer me to such a book if you know about one.
If the only option I have is to dig into the source code, how long would it take me to find the information I want? And if any of you have gone through the source code, what advise would you give to me?
The nice thing about open source is that there is no clear border between the source code and the documentation, since both are public. As soon as you get deeper into the implementation details, you will start reading the code. Fortunately the PostgreSQL code is well written and quite readable.
The first stop on your way into the source are the README files. These describe implementation principles, algorithms and code rules at a higher level. In your case, you should start with src/backend/access/transam/README.parallel.
Another good approach it to read the patches that introduced the feature, like 924bcf4f16d, 7aea8e4f2daa, d1b7c1ffe72, f0661c4e8c44 and 80558c1f5aa1. That introduces you to the places in the code that are concerned with parallel query and gives you an idea how it all works.

What is STRIPS in artificial intelligence? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a homework in artificial intelligence.
I need to make a robot go from room A to room B and there are obstacles between the rooms.
The professor asked me to use STRIPS (Stanford Research Institute Problem Solver) but I can not understand how STRIPS works.
Can someone give me a good explanation and examples about what is strips and how it work?
Thank you.
[Please note, this is based on what I half-remember from nearly a year ago]
These days, I would expect that when the Prof. says STRIPS, they would be talking about the problem coding 'language', rather than the planner - check for example the Wikipedia page: STRIPS. I would imagine that your Prof. likely has a particular solver (and quite possibly algorithm too) in mind, and is wanting you to encode the domain and specific problem, to run on the solver. Without knowing more details of the assignment, I can't be sure what you need. If you're looking for a planner, as I understand it, Fast Downward is quite popular among researchers currently. The website has some instructions on how to use it, and IIRC it comes with a bunch of domains and problems for those domains. I would thoroughly recommend looking at those, they're pretty much what I learnt with. I also just found this and this.
STRIPS is essentially a way of encoding information about the nature of the problem you want the computer to find a solution to. Typically you encode a domain, which provides information about the problem overall, such as what objects may be involved, what states they can be in, and what actions can be taken. Then, you also encode a particular problem, which (generally) specifies the starting state of the problem, and what the goal state should look like. Both those files are fed into a solver, which takes them and then finds a solution to the problem. Note that this won't always necessarily be an optimal solution - that depends on what algorithm you use, and how you have told the solver what should be optimised (which I think you can generally do in the problem, though I can't remember for sure now).
I suggest you have a look at those links, and see what you can find out. That should hopefully give you a better idea of what gaps you need filled in your knowledge, and then you can narrow in on exact specifics. If this is a taught course assignment, then I would expect that surely the Prof. would have gone over some of this in lectures (do you have lecture slides available?), or at least pointed everyone towards a recommended planner and material to read up on. If you're still struggling, your best bet is to go back and see the Prof. in office hours.

When developing a database, is it important to keep in mind a future application? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am in the process of designing a database for the first time outside of the classroom in order to make a future java application work with complete desired functionality. As I am trying to design entity relationship diagrams and tables, I find myself always thinking about my java project that is required later. I am beginning to wonder if this is making me more confused and if I am making this more difficult for myself; I am beginning to get nervous that I might not be skilled enough yet to pull this off.
Should I just focus on producing the most normalized database I can and trust that it will allow for my application to do everything it needs to do?
Or,
Should I definitely be keeping my future application in mind with each step of database development to ensure total functionality?
Edit: I would also appreciate any recommendations on free database design tools.
Databases are notoriously hard to refactor, so if you know about something you haven't gotten to yet but are definitely going to do, you need to consider that in your design. This is espcially true if the future something (For example reporting) is going to need to look at lots of records or is going to need moment in time data as opposed to doing calculations on the fly. This is the difference between storing the cost of an order vice calculating it based on current prices for instance. If you just look at the order process, you may thing it is ok to just calculate the price, but reporting will need to know what the price was at the time the order happened or the financial records will be messed up.
You might read this:
What are the general guidelines and best practices to keep in mind while designing database for an application?

how big would a "typical" Salesforce installation/configuration code base be, in lines of code? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I understand that "lines of code" may be not a fully accurate measurement standard because a lot of Salesforce configuration might be done through a gui. Nevertheless, for the sake of argument, let's say that either each config field manually filled out in the gui is like a line of code, or else let's imagine the configuration being done entirely in source code, if Force platforms allows it.
Well, so, how big would a typical "professional" or "enterprise" level Salesforce installation code base be? Is it like 1K lines? 10K lines? Are there many 100K and more cases out there?
You will hardly get a meaningful answer. Just as an example, one of my clients is an ISP, we've been at their enterprise instance for 5 years now and we have a code base of about 600KB (+testing code to satisfy 75% requirement) with more than 100 classes and even more pages, about 20 custom tabs, 30 custom objects and appx 150 custom fields on factory objects, all that complemented by a ton of work done on administrative customization. In contrast, their sister company operating the same business in another territory has its own instance without a single line of code, but they use it just for opportunity tracking and do their provisioning, operations, case mgmt and other stuff from different (legacy) systems.
The overall average statistics is something that salesforce only knows and they don't seem to be sharing that info. Either way you should get a trial of whatever instance you want and see what is missing in terms of your business requirements and plan accordingly.

Better approach to filtering Wikipedia edits [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
When you are watching for news of particular Wikipedia article via its RSS channel,
its annoying without filtering the information, because most of the edits is spam,
vandalism, minor edits etc.
My approach is to create filters. I decided to remove all edits that don't contain a nickname of the contributor but are identified only by the IP address of the contributor, because most of such edits is spam (though there are some good contributions). This was easy to do with regular expressions.
I also removed edits that contained vulgarisms and other typical spam keywords.
Do you know some better approach utilizing algorithms or heuristics with regular expressions, AI, text-processing techniques etc.? The approach should be able to detect bad posts (minor edits or vandalisms) and should be able to incrementally learn what is good/bad contribution and update its database.
thank you
There are many different approaches you can take here, but traditionally spam filters with incremental learning have been implemented using Naive bayesian classifiers. Personally, I prefer the even easier to implement Winnow2 algorithm (details can be found in this paper).
First you need to extract features from the text you want to classify. Unfortunately the Wikipedia RSS feeds don't seem to be particularly machine readable, so you probably need to do some preprocessing. Alternatively you could directly use the Mediawiki API or see if one of the bot frameworks linked at the bottom of this page is of help to you.
Ideally you would end up with a list of words that were added, words that were removed, various statistics you can compute from that, and the metadata of the edit. I imagine the list of features would look something like this:
editComment: wordA (wordA appears in edit comment)
-wordB (wordB removed from article)
+wordC (wordC added to article)
numWordsAdded: 17
numWordsRemoved: 22
editIsMinor: Yes
editByAnIP: No
editorUsername: Foo
etc.
Anything you think might be helpful in distinguishing good from bad edits.
Once you have extracted your features, it is fairly simple to use them to train the Winnow/Bayesian classifier.

Resources