Sorting an array of URLs - arrays

I have an array with quasar URLs stored in it
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0269/spec-0269-51581-0467.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0329/spec-0329-52056-0059.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/104/spectra/2957/spec-2957-54807-0164.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0342/spec-0342-51691-0089.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2881/spec-2881-54502-0508.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0302/spec-0302-51616-0435.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2947/spec-2947-54533-0371.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0301/spec-0301-51942-0460.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/104/spectra/2962/spec-2962-54774-0461.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2974/spec-2974-54592-0185.fits
I want to sort out the URL array on basis of the number next to spec- and not using alphabetic order. I sorted the array with sort but it didn't help as it always took the 3rd row and 2nd last row to the top because they have a 1.
I'd like to have an output like this
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0269/spec-0269-51581-0467.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0301/spec-0301-51942-0460.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0302/spec-0302-51616-0435.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0329/spec-0329-52056-0059.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0342/spec-0342-51691-0089.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2881/spec-2881-54502-0508.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2947/spec-2947-54533-0371.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/104/spectra/2957/spec-2957-54807-0164.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/104/spectra/2962/spec-2962-54774-0461.fits
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/2974/spec-2974-54592-0185.fits

If you will always have this pattern, you can try:
fileName = strsplit(myUrl, '/')(end)
number = strsplit(fileName(5:end), '.')(0)
Gonna walk you through this cause understanding is everything...
We start with
http://dr12.sdss3.org/sas/dr12/sdss/spectro/redux/26/spectra/0269/spec-0269-51581-0467.fits
First we split the URL on the / characters. This will return a vector of strings split up from this character. Since the number to sort on resides after the final /, we can pass end to grab the last one. Now we have
spec-0269-51581-0467.fits
Next, let's remove that pesky spec- from the number. This step isn't actually necessary, since it's constant across all the URLs, but let's just do it for fun. We can use Matlab's substring to grab the characters after the -, using fileName(5:end). This will create a string starting with the 5th character (in this case, a 0) and continue to the end. Great, now we have
0269-51581-0467.fits
Looking good! Again, this part isn't completely necessary either, but just in case for whatever reason you may need to, I've included it. We can use the strsplit function again, but this time split on the ., and grab the first element by passing a 0. Now, we have
0269-51581-0467
Go ahead and sort that little guy and you're good to go!

Related

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

C - Arrays/Struct discussion

I have to organize a struct or an array of some people's first name, last name, and age then organize them in alphabetical order writing them from an input file to an output file using the string library.
This is partically for a Lab that I have tomorrow afternoon in which my TA said we might be asked to complete the task I stated above. I'm trying to get opinions or suggestions from individuals whom are much more experienced in C than myself so I'll be more prepared than I was last lab where I didn't so hot.
I'm stuck, any suggestions on where I should start?
Well, reading and writing the input and output should be a no brainer (you open the file, read/write and then close it when you're done with it).
The trick is to get those pesky strings sorted.
The way I look at strnig sorting is an array of arrays sorting. The first array-level is an iteration over all the structs while the second is an iteration over all the letters of a name (note that some names are longer then others)
1. First you sort the first column - all the first letters
2. Then all the sub-columns - within each first letter group you sort the second letter
3. Repeat 1&2 until you run out of letters to sort.
The sorting of an array of letters is the same as sorting an array of byte (AKA chars), and if you're working with an algorithm like bubble sort, working on a sub array is seamless.
Hope this idea helps
You can use strcmp from string.h library, which compares two strings and returns the result.
You should read every string from input file and keep in memory. You will probably use a list for that, for example linked list. While adding you can organize them in alphabetical order. Here is an example:
Take the an element in list.
Compare it with new string.
Where should it be placed? if it should be before the current element, place it before the current element. If it should be after current element, do the same for next element.
This is a simple algorithm for your problem.
The tricky part is to implement the comparation of two strings. the strcmp function only tells you whether the strings are identical. You can loop through the char array to do it.
Once this is done, the remaining sorting is simple.

What is the best way to count the number of times a value occurs in an array?

I have been given an assignment to write a program that reads in a number of assignment marks from a text file into an array, and then counts how many marks there are within particular brackets, i.e. 40-49, 50-59 etc. A value of -1 in the text file means that the assignment was not handed in, and a value of 0 means that the assignment was so bad that it was ungraded.
I could do this easily using a couple of for loops, and then using if statements to check the values whilst incrementing appropriate integers to count the number of occurences, but in order to get higher marks I need to implement the program in a "better" way. What would be a better, more efficient way to do this? I'm not looking for code right now, just simply "This is what you should do". I've tried to think of different ways to do it, but none of them seem to be better, and I feel as if I'm just trying to make it complicated for the sake of it.
I tried using the 2D array that the values are stored in as a parameter of a function, and then using the function to print out the number of occurences of the particular values, but I couldn't get this to compile as my syntax was wrong for using a 2D array as a parameter, and I'm not too sure about how to do this.
Any help would be appreciated, thanks.
Why do you need a couple for loops? one is enough.
Create an array of size 10 where array[0] is marks between 0-9, array[1] is marks between 10-19, etc. When you see a number, put it in the appropriate array bucket using integer division, e.g. array[(int)mark/10]++. when you finish the array will contain the count of the number of marks in each bucket.
As food for thought, if this is a school assignment, you might want to apply other things you have learned in the course.
Did you learn sorting yet? Maybe you could sort the list first so that you are not iterating over the array several times. You can just go over it once, grab all the -1's, and spit out how many you have, then grab all the ones in the next bracket and so on.
edit: that is of course, assuming that you are using a 1d array.

Finding a match in an array using strcmp

im trying to compare words of an array using strcmp.Im trying to get each word that appears more than once in the array to print out only once, so i can determine the number of unique words.I know what its doing wrong as when it searches the array it prints out each copy it finds, for example if the word "the" is in the array 4 times, it will print out 'the' 3 times and when string1 goes to the next location where 'the' is, it will print out 2 times and so on.
Convert your char arrays to std::string and instead of printing them, put them into an std::set. Then print each element in the set.
good that you added declarations,
now from what it looks it seems as if words[][] is redundant and makes things unnecessary complicated. if you are only interested in getting unique words, instead just process what comes back from strtrok by building up a dictionary with the encountered words
a dictionary could be something as simple as a max sized array containing unique words, and an index that starts at 0 when array is empty, whenever strtok returns a word, go through the array and look for the word using your strcmp, if it doesn't exist add it at the end of the array then increment then index.
and bob is your uncle :)

What is the best data structure and algorithm for comparing a list of strings?

I want to find the longest possible sequence of words that match the following rules:
Each word can be used at most once
All words are Strings
Two strings sa and sb can be concatenated if the LAST two characters of sa matches the first two characters of sb.
In the case of concatenation, it is performed by overlapping those characters. For example:
sa = "torino"
sb = "novara"
sa concat sb = "torinovara"
For example, I have the following input file, "input.txt":
novara
torino
vercelli
ravenna
napoli
liverno
messania
noviligure
roma
And, the output of the above file according to the above rules should be:
torino
novara
ravenna
napoli
livorno
noviligure
since the longest possible concatenation is:
torinovaravennapolivornovilligure
Can anyone please help me out with this? What would be the best data structure for this?
This can be presented as a directed graph problem -- the nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is chosen to get the longest length), and then find the highest weight non-intersecting path.
(Well, actually, you want to expand the graph a bit to handle beginning and ending at a word. Adjoin a "starting node" with with an edge to each word of weight length word / 2.
Between words, -overlap + length start + length finish / 2, and between each word and an "ending node" "length word / 2". Might be easier to double it.)
https://cstheory.stackexchange.com/questions/3684/max-non-overlapping-path-in-weighted-graph
I would start really simple. Make 2 vectors of strings, one sorted normally, one sorted by the last 2 letters. Create an index (vector of ints) for the second vector that points out it's position in the first.
To find the longest.. first remove the orphans. words that don't match at either end to anything. Then you want to build a neighbor joining tree, this is where you determine which words can possibly reach each other. If you have 2 or more trees you should try the largest tree first.
Now with a tree your job is to find ends that are rare, and bind them to other ends, and repeat. This should get you a pretty nice solution, if it uses all the words your golden, skip the other trees. If it doesn't then your into a whole slew of algorithms to make this efficient.
Some items to consider:
If you have 3+ unique ends you are guaranteed to drop 1+ words.
This can be used to prune your tries down while hunting for an answer.
recalculate unique ends often.
Odd numbers of a given end ensure that one must be dropped(you get 2 freebies at the ends).
Segregate words that can self match , you can always throw them in last, and they muck up the math otherwise.
You may be able to create small self matching rings, you can treat these like self matching words, as long as you don't orphan them when you create them. This can make the performance fantastic, but no guarantees on a perfect solution.
The search space is order(N!) a list of millions of elements may be hard to prove an exact answer. Of course I could be overlooking something.

Resources