Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
This question came up today and I couldn't find any historical answer as to why a database is always represented as a cylinder. I am hoping someone in the stack world would know why and have a link or something backing it up.
I'm reasonably certain that it predates disk drives, and goes back to a considerably older technology: drum memory:
Another possibility (or maybe the choice was based on both) is a still older technology: mercury tank memory:
You may have seen the symbol oriented horizontally instead of vertically, but horizontal drums were common as well:
You asked for more pics. I took these at the computer history museum in Mountain View, CA in May 2016.
Description for the above image says:
UNIVAC I mercury memory tank, Remington Rand, US, 1951
For memory, the UNIVAC used seven mercury delay line tanks. Eighteen pairs of crystal transducers in each tank transmitted and received data as waves in mercury held at a constant 149°F
Gift of William Agee X976.89
Description for the above image says:
Williams-Kilburn tube - Manchester Mark I, Manchester University, UK, ca 1950
This was the memory in the Manchester Mark I, the successor to the "Baby." It stored only 128 40-bit words. Each bit was an electric charge that created a spot of light on the face of a "TV tube."
Gift of Manchester University Computer Science Department, X67.82
It's because people view a DB as simple storage, much like a disk. And disk storage has always been represented by a cylinder due to, well, the physical properties of spinning magnetic disks.
I always assumed it stood for the round edges of a hard drive platter. The average consumer might not have necessarily known what a Physical Hard Drive Component looked like, so it was represented as a cylinder.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I have a PDF file that includes a table and I want to convert it into table structured data.
My PDF file includes a pretty complex table which makes most tool insufficient. For example,
I tried to use the following tools and they didn't extract it well: AWS Textract, Google AI Document, Google Vision, Microsoft Text Recognition.
Actually, Google AI Document managed to do about 70% correct but it is not good enough.
So, I searched for a way to customize train model, so that when extracting this table, it will extract it properly. I tried Power Apps AI Builder and Google AutoML Entity Extraction, but both of them didn't help (BTW, I wasn't what AutoML's purpose, is it for prediction or also possible to customize table extraction?).
I would like to know which tools are good for my use case and if there is any (AI) tool that I can use to train these kind of tables, so that the text extraction will be better.
Most text extractors should hold that structure if it is rendered crisp enough, but layout can be many a fickle mis-trees.
Here it correctly picked up the mis-spelling of reaar but failed in first line on 05.05.1983
On an identical secondpass the failings are different
3 29.06.1983 Part of Ground Floor of 05.05.1983 GM315727
2 (part of) Conavon Court 25 years from
1.3.1983
4 31.01.1984 Part of Third Floor Conavon 30.12.1983 GM335793
4 (part of) Court 25 years from
12.8.1983
5 19.04.1984 I?art of Basement Floor of 23.01.1984 GM342693
l (part of), 2 Conavon C:ourt 25 years from
(part of), 3 20.01.1984
(part Of ) , 4
(part of)
NOTE: The Lease also grants a right of way for the purpose only of
loading and unloading and reserves a right of way in case of emergency
only from the boiler house adjacent hereto
6 14.06.1984 Part of Third Floor Conavon 31.10.1983 GM347623
3 (part of) Court 25 years from
31.10.1983
7 14.06.1984 Part of the Third Floor 31.10.1983 GM347623
3 (part: of}, 4 Conavon Court 25 years from
(part of) 31.10.1983
8 01.10.1984 "The Italian Stallion'' 17.08.1984 GM357142
4 (part of) Conavon Court (Basement) 25 years from
20.1.1984
NOTE: The Lease also grants a right of way for the purpose only of
loading and unloading and a right of access through the security door
at the reaar of the building
9 06.07.2016 3rd floor 14-16 Blackfriars 28.06.2016
4 (part of}, 5 Streec 5 years from
(part of) 25/06/2016
That's the beauty of OCR, every run can be a different pass rate per character so experience says use best of three estimates. Thus run 3 different ways and comparing character by character keep those that are in agreement.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am working on algorithm where i can have any number of 16 bit values(For instance i have 1000 16 bit values , and all are sensor data, so no particular series or repetition). I want to stuff all of this data into an 8 or a 10 byte array(each and every value of the 1000 16 bits numbers should be inside the 10 byte array) . The information should be such that i can also easily decode to read each and every value from the 1000 values.
I have thought of using sin function by dividing the values by 100 so every data point would always be in 8 bits(0-1 sin value range) , but that only covers up small range of data and not huge number of values.
Pardon me if i am asking for too much. I am just curious if its possible or not.
The answer to this question is rather obvious with a little knowledge in information sciences. It is not possible to store that much information in so little memory, and the data you are talking about just contains too much information.
Some data, like repetitive data or data which is following some structure (like constantly rising values), contains very little information. The task of compression algorithms is to figure out the structure or repetition and instead of storing the pure data to store the structure or rule how to reproduce the data instead.
In your case, the data is coming from sensors and unless you are willing to lose a massive amount of information, you will not be able to generate a compressed version of it with a compression factor in the magnitude your are talking about (1000 × 2 bytes into 10 bytes). If your sensors more or less produce the same values all the time with just a little jitter, a good compression can be achieved (but for this your question is way to broad to be answered here) but it will probably never be in the range of reducing your 1000 values to 10 bytes.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm implementing a system that could detect the human emotion in text. Are there any manually annotated data sets available for supervised learning and testing?
Here are some interesting datasets:
https://dataturks.com/projects/trending
The field of textual emotion detection is still very new and the literature is fragmented in many different journals of different fields. Its really hard to get a good look on whats out there.
Note that there a several emotion theories psychology. Hence there a different ways of modeling/representing emotions in computing. Most of the times "emotion" refers to a phenomena such as anger, fear or joy. Other theories state that all emotions can be represented in a multi-dimensional space (so there is an infinite number of them).
Here are a some (publicly available) data sets I know of (updated):
EmoBank. 10k sentences annotated with Valence, Arousal and Dominance values (disclosure: I am one of the authors). https://github.com/JULIELab/EmoBank
The "Emotion Intensity in Tweets" data set from the WASSA 2017 shared task. http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
The Valence and Arousal Facebook Posts by Preotiuc-Pietro and
others:
http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv
The Affect data by Cecilia Ovesdotter Alm:
http://people.rc.rit.edu/~coagla/affectdata/index.html
The Emotion in Text data set by CrowdFlower
https://www.crowdflower.com/wp-content/uploads/2016/07/text_emotion.csv
ISEAR:
http://emotion-research.net/toolbox/toolboxdatabase.2006-10-13.2581092615
Test Corpus of SemEval 2007 (Task on Affective Text)
http://web.eecs.umich.edu/~mihalcea/downloads.html
A reannotation of the SemEval Stance data with emotions:
http://www.ims.uni-stuttgart.de/data/ssec
If you want to go deeper into the topic, here are some surveys I recommend (disclosure: I authored the first one).
Buechel, S., & Hahn, U. (2016). Emotion Analysis as a Regression Problem — Dimensional Models and Their Implications on Emotion Representation and Metrical Evaluation. In ECAI 2016.22nd European Conference on Artificial Intelligence (pp. 1114–1122). The Hague, Netherlands (available: http://ebooks.iospress.nl/volumearticle/44864).
Canales, L., & Martínez-Barco, P. (n.d.). Emotion Detection from text: A Survey. Processing in the 5th Information Systems Research Working Days (JISIC 2014), 37 (available: http://www.aclweb.org/anthology/W14-6905).
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
Update: this has been reported to Microsoft.
In a simple (SQL Server 2012) table with a geography column (name geopoint) populated with a few simple rows points similar to this. POINT (-0.120875610750927 54.1165118880234) etc. executing
select [geopoint].[STAsText](),
[geopoint].Lat lat,
[geopoint].Long long
from mytable
produces this
Untitled1 lat long
POINT (-0.120875610750927 54.1165118880234) 54.1165118880234 -0.120875610750927
which looks like a bug, but it is too basic and should have been caught before release. So am I doing something wrong?
Added info
IT professionals should look for the details of Microsoft's implementation of SQL server on MSDN. As there can be differences in implementation. As per this case. As a proof of this I just checked PostGist's implementation of ST_AsText for a geographic column. This works fine! and result is as one would expect. Therefore the bug is in implementation of SQL. The correct result for the above example should be
POINT (54.1165118880234 -0.120875610750927 ) 54.1165118880234 -0.120875610750927
Dare I say there is a high likelihood that there are other bugs associated with functions working geographic columns. As basic functionality in this area has not been fully tested.
This is working as intended.
According to your question, you stored the data in this pattern:
POINT (-0.120875610750927 54.1165118880234)
then you claimed that the lat/long is reversed according to the MSDN documentation of
Point(Lat, Long, SRID).
You may realize that the syntax you're using is not the same as the one you claim:
POINT(aValue anotherValue) vs Point(Lat, Long, SRID)
Now, the question is, what does MS SQL do to the data?
Turns out, MS SQL interprets the data as Open Geospatial Consortium (OGC) Well-Known Text (WKT), and thus use STPointFromText function since the format is the most suitable for 2-D point:
POINT(x y)
Now, the follow-up question, does it mean POINT(Lat Long)?
From the sample code
SET #g = geography::STPointFromText('POINT(-122.34900 47.65100)', 4326);
it should be clear that the first parameter is not latitude, but longitude (the range of latitude range is from -90 to 90 only), so now we guess that the format is POINT(Long Lat) then. But why?
As explained in this article,
As you can see [...], the longitude is specified first before the latitude. The reason is because in the Open Geospatial Consortium (OGC) Well-Known Text (WKT) representation, the format is (x, y). Geographic coordinates are usually specified by Lat/Long but between these two, the X is the Longitude while the Y is the Latitude.
You may be wondering why the X-coordinate is the Longitude while the Y-coordinate is the Latitude. Think of the equator of the earth as the x-axis while the prime meridian is the Y-axis. Longitude is defined as the distance from the prime meridian along the x-axis (or the equator). Similarly, latitude is defined as the distance from the equator along the Y-axis.
This is a bug. Returned value for STAsText for a geography column swaps the Lat and Long values. Definitely a bug which people should be aware of.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project as a part of my class curriculum . Its a project for Advanced Database Management Systems and it goes like this.
1)Download large number of images (1000,000) --> Done
2)Cluster them according to their visual Similarity
a)Find histogram of each image --> Done
b)Now group (cluster) images according to their visual similarity.
Now, I am having a problem with part 2b. Here is what I did:
A)I found the histogram of each image using matlab and now have represented it using a 1D vector(16 X 16 X 16) . There are 4096 values in a single vector.
B)I generated an ARFF file. It has the following format. There are 1000,000 histograms (1 for each image..thus 1000,000 rows in the file) and 4097 values in each row (image_name + 4096 double values to represent the histogram)
C)The file size is 34 GB. THE BIG QUESTION: HOW THE HECK DO I CLUSTER THIS FILE???
I tried using WEKA and other online tools. But they all hang. Weka gets stuck and says "Reading a file".
I have a RAM of 8 GB on my desktop. I don't have access to any cluster as such. I tried googling but couldn't find anything helpful about clustering large datasets. How do I cluster these entries?
This is what I thought:
Approach One:
Should I do it in batches of 50,000 or something? Like, cluster the first 50,000 entries. Find as many possible clusters call them k1,k2,k3... kn.
Then pick the the next 50,000 and allot them to one of these clusters and so on? Will this be an accurate representation of all the images. Because, clustering is done only on the basis of first 50,000 images!!
Approach Two:
Do the above process using random 50,000 entries?
Any one any inputs?
Thanks!
EDIT 1:
Any clustering algorithm can be used.
Weka isn't your best too for this. I found ELKI to be much more powerful (and faster) when it comes to clustering. The largest I've ran are ~3 million objects in 128 dimensions.
However, note that at this size and dimensionality, your main concern should be result quality.
If you run e.g. k-means, the result will essentially be random because of you using 4096 histogram bins (way too much, in particular with squared euclidean distance).
To get good result, you need to step back an think some more.
What makes two images similar. How can you measure similarity? Verify your similarity measure first.
Which algorithm can use this notion of similarity? Verify the algorithm on a small data set first.
How can the algorithm be scaled up using indexing or parallelism?
In my experience, color histograms worked best on the range of 8 bins for hue x 3 bins for saturation x 3 bins for brightness. Beyond that, the binning is too fine grained. Plus it destroys your similarity measure.
If you run k-means, you gain absolutely nothing by adding more data. It searches for statistical means and adding more data won't find a different mean, but just some more digits of precision. So you may just as well use a sample of just 10k or 100k pictures, and you will get virtually the same results.
Running it several times for independent sets of pictures results in different cluster clusters which are difficult to merge. Thus two similar images are placed in different clusters. I would run the clustering algorithm for a random set of images (as large as possible) and use these cluster definitions to sort all other images.
Alternative: Reduce the compexity of your data, e.g. to a histogram of 1024 double values.