Friday, 29 January 2016

Some thoughts on Vectorisation

Last post, I mentioned something about vectorising data.

What does this mean, and why do it?

A vector is a measurement with both a magnitude and a direction. Sometimes the directions are described in terms of a bearing, but it's also conventional to determine a vector as a tuple of values in a series of dimensions, each of which is (normally) considered as being at right-angles to one another in a coordinate system.

The nice thing about vectors in an n-dimensional coordinate system is that they can be manipulated and worked with using linear algebra, enabling the computerisation of functions such as finding nearest matches, categorisation, clustering, network analysis, search algorithms and other useful techniques.

Let's take an example - say I've got the latitude and longitude coordinates of a group of people who are out with their mobile phones. I could calculate their relative distances and use this to pair them off with their nearest counterparts (subject to there being an even number of them, and ignoring tricky 3-way type situations for now). That's basic employment of Pythagoras' theorem.

$$h^2=a^2+b^2$$

And then finding the minimum value for each pair.

Using slightly more advanced mathematics, we could group these pairs into clusters, or find the line that best splits them into two, three, or more groups - it's all just an extension of Pythagoras (ok, sorry Linear Algebra folks, I know, that is an over-simplification)

Now consider someone typing in "Piefegoras" into a Google-search. How might Google be able to get from that munged input request yet still relay the results for Pythagoras?

One way might be to "vectorise" the english language. Take each of the 1,000,000 or so English words, and break them down into a 26-dimensional vector space where each dimension represents the number of letters corresponding to that dimension. So the word "aardvark" for example might be represented as the vector (3,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0) - and the word "vector" would be represented as the vector (0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0) . Now in both these examples, there are 26 dimensions, so we might have to tweak Pythagoras a little bit - but it turns out that the distance between the two can be determined using: $$h^2 = a^2+b^2+c^2+d^2+e^2+f^2+ ... +w^2+x^2+y^2+z^2$$

Where each of (a, b, c,...,x, y, z) is the difference between that letter's vector components. So for example, the a value in the above formula would be 3 since there are 3 a's in "aardvark" and none in "vector". So already, we can map out all the words in the dictionary, and indeed all the words not in the dictionary (I looked up "Piefegoras" and it's not there...yet) that use the 26-characters we have in the English alphabet.

Once we can build a map of this vector space, we can do all the same things we did with the people and their 2d latitude and longitude values - cluster words that consist of similar letter counts - indeed, we can use this to classify all words that share the same numbers of letters, i.e. anagrams, or that are 1 letter different, or 2 letters different, and so on - basically, half the puzzles page at the back of the newspaper becomes trivial from now on.

But not only this, we now have a viable spell-checker, or as in the example, a "Did you mean.."-er that will take garbled input and be able to find a nearest neighbour and offer that as a suggestion.

This is a simplification, our initial set of 26 dimensions is ok, but doesn't make the distinction between say, "hatred" from "thread", or "melons" from "lemons". So we need to include an additional set of vectors that are sensitive to letter placement and order, such a set might be those vectors that correspond to each letter pair, such as ("aa", "ab", "ac", ..., "zx", "zy", "zz") this is an impressively large vector-set, consisting of an additional 676 dimensions to a vector space consisting of a total of 702 dimensions. This makes the maths a little more unwieldy, and the storage requirements higher, but hey, it's 2016 and we've got computers. And why stop there, it may be advantageous to include 3-tuples of letters in the set ("aaa", "aab", "aac", ... " "zzx", "zzy", "zzz") or 4-tuples, or 5-tuples and so on - each additional n-tuple exponentially increasing the size of the dimensions under review - so it makes sense to find a sensible balance-point where the additional expense doesn't provide enough utility to warrant the additional complexity. My guess this is around 3-tuples, but 2 is probably enough for most jobs.

So with this improved vector scheme we have, in purely algorithmic terms, a really simple way of breaking down any series of alphabetic characters and comparing its "distance" from any other series of alphabetic characters. The concept of "distance" here is actually very close to our human concept of "similarity" at least in terms of the alphabetic building blocks of two words and that's actually quite powerful.

What's more, by simply extending the input conditions we can include numerics, "foreign" characters, shapes, even pictograms - and still be able to identify closeness. It would be interesting to put something like this together, and see if, using clustering and a bit of human guidance (though it'd be interesting to try otherwise) such a system might be able to help classify English words in terms of their origin. English words from French for example, might share a particular set of features like the inclusion of perhaps ("-re", "-aux", "-ois") perhaps, while Germanic or old English words might contain ( "-ang", "-tz", "-gh" ) or similar - I must confess, I'm no linguist. But, the concept ought to yield if nothing else, a series of cluster-words, the provenance of which, I'd wager, might be shared.

Another fun thing to consider is that we might be able to establish theoretical words - imagine a feature space consisting of all 1-tuple, 2-tuple, 3, 4, 5 and 6-tuple dimensions. If we then populated that space will all the words in the dictionary, from a text, or traweled from the internet, we'd be able to find the most "average" word by picking the one that was most central. Or, more spookily, calculating the most central location in the feature-space, and re-constructing the word that might appear there - despite it never have existed in any dictionary, been spelt out in a forum or commited to paper, print or screen. Performing this reconstructive process - essentially performing the transformation backwards from a location in the feature space is tricky, but if the space were constructed formally enough, this would be possible.

The downsides of this approach, especially if you want to build in the reversibility feature is that you need a massively wide set of dimensions. There are some clever ways to keep these down to a minimum that I might go into later - and there are also some problems inherent in this approach when you're trying to calculate distances between more than say 100 vectors at a time - but we can talk about these later.

To summarise then, this is a technique for converting a one-dimensional string of objects each of which belongs to a finite set (i.e. an alphabet) into an n-dimensional feature-space, where n is the size of the alphabet plus n² for 2-tuples, plus n³ for 3-tuples and so on, for the number of x-tuples to be included to take care of the anagram problem, the purpose of which is to "vectorise" such one-dimensional strings in order to perform "similarity" functions upon them.

It's worth noting here that these techniques are not new, nor particularly original - indeed they're well known and widely publicised. I think they're interesting though, and hope the reader finds something of interest in my discussing them.

I'm working on some code examples at the moment which I'll share along with results once I'm happy.

Wednesday, 9 December 2015

Data and Change

It's an interesting time, right now - I'm not entirely sure I've got a handle on it.

Since Paris, I've done a little freelance work, but mostly been looking for the next long-term job, and in doing so, making something of a minor course-correction.

You see, for a while now, I've been working very much in a niche function of a niche (if global) industry. Even more nichely, the focus has been around a particular software vendor's product and its support. It's served me well, and given me a depth and breadth of experience I might otherwise have found difficult to find elsewhere. I've also benefited from a fairly close-knit network of colleagues, each of whom have a similar set of experiences built up over time. But as time goes by, I'm finding myself interested in areas that don't neatly overlap with recs.

Interestingly, that's quite a statement - there's not much that doesn't somehow share some degree of contextual intersection with software modelled reconciliations - but that's another story.

So in this most recent job-search, I've been trying to take a step away from recs, broaden my horizons, and do something different - ideally, something linked to data-science, analytics or similar. To assist, countless recruitment consultants have called, sent over specs and discussed skills, salaries and start-dates. Sometimes, I've enjoyed the process, and at others, I've found it deeply frustrating. I don't envy the recruitment consultant's role in life. Some of them are good, decent and honest people. Others, and I'm sure these were just having a bad day, didn't fill me full of confidence. Which is fine under normal circumstances - but when you're responsible for a family, you don't want to place your future in the hands of someone who lies at the drop of the hat - least-so, one who lies so unconvincingly.

As I write, there is an offer in the works, for a job whose title hasn't yet been decided, the working arrangements of which are not yet known, but for which I'm very excited. It will mean working with data, and helping others work effectively with data too - and, to an extent not yet entirely known, it may also mean working on new technology and methodologies within an environment of like-minded people.

I'm not sure if it's fair, but this could be my Solsbury Hill moment, which for me suggests a kind of homecoming/becoming thing. I'm sure I'm over-romanticising here.

Upcoming things to think about -

  • What is 'data'
  • What common problems exist when handling/managing data
  • How can we break those problems down into their most abstract form?
  • What tools can we build around those abstract ideas that will generate practical value?
  • How can we leverage the usage differences between mutable and immutable data records?
  • What standards are out there for taxonomies, classifications, models and forms?
  • What are the practical pros and cons of those standards?
  • How can different data be 'vectorised' i.e. turned into a vector?

I'm sure there will be more, but these are what's interesting at the moment.

Thursday, 23 July 2015

Auf Wiedersehen Paris

After an interesting few months, it looks as though my time in Paris is coming to an end.

It's been an interesting assignment, and I've been able to work on developing lots of interesting ideas that hopefully I can take forward into my next role - wherever that will be.

I don't look forward to restarting the process of finding paying work, but it's important to approach these things flexibly, and positively - otherwise, things can quickly spiral into negativity - which is no fun, and is rarely warranted.

I'd like to focus more on data analytics in the future, drawing information, narrative and decision from raw data - and helping others get to grips with the kinds of techniques and applications that can make this happen.

I'd also like to continue working with python - I'm still impressed with the speed with which thorny problems can be tackled - and would like to apply some of these techniques in a more commercial setting. So far, much of my work has been to provide tools, the users of which don't necessarily see or know the internal workings - only the outputs.

So, that's what I'd like - but at some point may have to take something slightly off-track, got to pay those bills.

Yesterday, I started applying to nearly every job pinged over to my email by the JobServe search engine, and so far, not received a sausage in terms of a human reply. Either there's some glaring gap in my CV, or I should relax a little (for now) and wait to see what comes in later - although, characteristically, I notice that I'm faced with a data-analysis problem here

I don't want to apply for any job more than once - but the job-boards are populated with entries from different agencies - more, there are multiple jobs-boards. This means that any particular position is likely to be posted to more than one board, and by more than one agent. This potentially results in duplication - so the question is - how to systematically keep track of what jobs have been applied for, in what areas, and via which agencies?

I think I may have found my next working project...

If last time was anything to go by, finding the next thing could take something like 2 months - but fingers crossed it's quicker than that.

Oh and Paris? Yes, it's been nice knowing you.