Last post, I mentioned something about vectorising data.

What does this mean, and why do it?

A vector is a measurement with both a magnitude and a direction. Sometimes the directions are described in terms of a bearing, but it's also conventional to determine a vector as a tuple of values in a series of dimensions, each of which is (normally) considered as being at right-angles to one another in a coordinate system.

The nice thing about vectors in an n-dimensional coordinate system is that they can be manipulated and worked with using linear algebra, enabling the computerisation of functions such as finding nearest matches, categorisation, clustering, network analysis, search algorithms and other useful techniques.

Let's take an example - say I've got the latitude and longitude coordinates of a group of people who are out with their mobile phones. I could calculate their relative distances and use this to pair them off with their nearest counterparts (subject to there being an even number of them, and ignoring tricky 3-way type situations for now). That's basic employment of Pythagoras' theorem.

$$h^2=a^2+b^2$$

And then finding the minimum value for each pair.

Using slightly more advanced mathematics, we could group these pairs into clusters, or find the line that best splits them into two, three, or more groups - it's all just an extension of Pythagoras (ok, sorry Linear Algebra folks, I know, that is an over-simplification)

Now consider someone typing in "Piefegoras" into a Google-search. How might Google be able to get from that munged input request yet still relay the results for Pythagoras?

One way might be to "vectorise" the english language. Take each of the 1,000,000 or so English words, and break them down into a 26-dimensional vector space where each dimension represents the number of letters corresponding to that dimension. So the word "aardvark" for example might be represented as the vector (3,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0) - and the word "vector" would be represented as the vector (0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0) . Now in both these examples, there are 26 dimensions, so we might have to tweak Pythagoras a little bit - but it turns out that the distance between the two can be determined using: $$h^2 = a^2+b^2+c^2+d^2+e^2+f^2+ ... +w^2+x^2+y^2+z^2$$

Where each of (a, b, c,...,x, y, z) is the difference between that letter's vector components. So for example, the a value in the above formula would be 3 since there are 3 a's in "aardvark" and none in "vector". So already, we can map out all the words in the dictionary, and indeed all the words not in the dictionary (I looked up "Piefegoras" and it's not there...yet) that use the 26-characters we have in the English alphabet.

Once we can build a map of this vector space, we can do all the same things we did with the people and their 2d latitude and longitude values - cluster words that consist of similar letter counts - indeed, we can use this to classify all words that share the same numbers of letters, i.e. anagrams, or that are 1 letter different, or 2 letters different, and so on - basically, half the puzzles page at the back of the newspaper becomes trivial from now on.

But not only this, we now have a viable spell-checker, or as in the example, a "Did you mean.."-er that will take garbled input and be able to find a nearest neighbour and offer that as a suggestion.

This is a simplification, our initial set of 26 dimensions is ok, but doesn't make the distinction between say, "hatred" from "thread", or "melons" from "lemons". So we need to include an additional set of vectors that are sensitive to letter placement and order, such a set might be those vectors that correspond to each letter pair, such as ("aa", "ab", "ac", ..., "zx", "zy", "zz") this is an impressively large vector-set, consisting of an additional 676 dimensions to a vector space consisting of a total of 702 dimensions. This makes the maths a little more unwieldy, and the storage requirements higher, but hey, it's 2016 and we've got computers. And why stop there, it may be advantageous to include 3-tuples of letters in the set ("aaa", "aab", "aac", ... " "zzx", "zzy", "zzz") or 4-tuples, or 5-tuples and so on - each additional n-tuple exponentially increasing the size of the dimensions under review - so it makes sense to find a sensible balance-point where the additional expense doesn't provide enough utility to warrant the additional complexity. My guess this is around 3-tuples, but 2 is probably enough for most jobs.

So with this improved vector scheme we have, in purely algorithmic terms, a really simple way of breaking down any series of alphabetic characters and comparing its "distance" from any other series of alphabetic characters. The concept of "distance" here is actually very close to our human concept of "similarity" at least in terms of the alphabetic building blocks of two words and that's actually quite powerful.

What's more, by simply extending the input conditions we can include numerics, "foreign" characters, shapes, even pictograms - and still be able to identify closeness. It would be interesting to put something like this together, and see if, using clustering and a bit of human guidance (though it'd be interesting to try otherwise) such a system might be able to help classify English words in terms of their origin. English words from French for example, might share a particular set of features like the inclusion of perhaps ("-re", "-aux", "-ois") perhaps, while Germanic or old English words might contain ( "-ang", "-tz", "-gh" ) or similar - I must confess, I'm no linguist. But, the concept ought to yield if nothing else, a series of cluster-words, the provenance of which, I'd wager, might be shared.

Another fun thing to consider is that we might be able to establish theoretical words - imagine a feature space consisting of all 1-tuple, 2-tuple, 3, 4, 5 and 6-tuple dimensions. If we then populated that space will all the words in the dictionary, from a text, or traweled from the internet, we'd be able to find the most "average" word by picking the one that was most central. Or, more spookily, calculating the most central location in the feature-space, and re-constructing the word that might appear there - despite it never have existed in any dictionary, been spelt out in a forum or commited to paper, print or screen. Performing this reconstructive process - essentially performing the transformation backwards from a location in the feature space is tricky, but if the space were constructed formally enough, this would be possible.

The downsides of this approach, especially if you want to build in the reversibility feature is that you need a massively wide set of dimensions. There are some clever ways to keep these down to a minimum that I might go into later - and there are also some problems inherent in this approach when you're trying to calculate distances between more than say 100 vectors at a time - but we can talk about these later.

To summarise then, this is a technique for converting a one-dimensional string of objects each of which belongs to a finite set (i.e. an alphabet) into an n-dimensional feature-space, where n is the size of the alphabet plus n² for 2-tuples, plus n³ for 3-tuples and so on, for the number of x-tuples to be included to take care of the anagram problem, the purpose of which is to "vectorise" such one-dimensional strings in order to perform "similarity" functions upon them.

It's worth noting here that these techniques are not new, nor particularly original - indeed they're well known and widely publicised. I think they're interesting though, and hope the reader finds something of interest in my discussing them.

I'm working on some code examples at the moment which I'll share along with results once I'm happy.