Tuesday, 11 July 2017

Doing Data Science - Curve Fitting

I've been working on is consolidating my efforts into continuing to not being unemployed, building up a skills-base should unfortunate turns for the worse manifest.

This last year or so, I've been given a huge opportunity - sticthing together elements from a number of relatively new technologies and trying to form them into a unified, cohesive, useful offering.

I've not talked here about linked-data here before, but it's worth looking into. The idea behind it is to create a web of data points, the same way that we're now familiar with a web of documents. Instead of using HTML as the language for building this web, we use something called RDF. Interestingly, RDF began its life way back in 1997, so itself it's not particularly new.

On its own, RDF is a little underwhelming (though it does have a perculearly elegant structure - some may disagree) but what is cool about it is that it provides a means of easily integrating datasets from all over the place - enabling one to conduct the kind of searches across those integrated datasets that would normally be tricky using regular SQL. Things like dependency chains and friend-of-friend type searches for example. Add in some graph theory (i.e. network theory, not some meta-notions about charts and diagrams), and you're suddenly able to determine cliques, most connected nodes, functional behaviour of different objects, anomaly detection and all manner of other cool and immensely useful stuff.

Fold in a bit of machine learning, textual analysis and some powerful statistical crowbars, and you've got a really decent analytical platform, capable of generating a great deal of insight and business information for a relatively low investment.

So that's 3 separate areas of development I've been working in - and it's been a proper journey. I've coded into the late evenings, consulted O'Reilly, StackOverflow, and a number of extremely gifted colleagues who've revealed to me quite how little I knew. I think that's been simultaneously one of the most inspiring and humbling aspects of all this, having had an opportunity to mix with mathematicians, scientists and engineers and in the process learn and create something new - in the process (hopefully) raising my own game - a year in, I feel a little less of a fraud.

Python has made this practical appreciation of the abstract possible for me - in a way I've not found elsewhere in 30 years - finally, knowing the theory actually counts for something. If, like me, you're shaky on the theory, you're no longer tied down to performing hours of tedious calculations by hand in order to try something out.

Let's do an example - to give an idea of what we're talking about:

I wanted to construct a population model to explore the idea that birth-rates are linked to population density. The hypothesis being that people are fussy and are more likely to find a partner given a large enough pool to choose from. I was imagining a mechanic akin to each person in a population rolling an n-sided dice, where people roll the same number, hurray, they pair-off and start making babies.

This is not dissimilar to the idea of the "Birthday Conjecture" in which it is shown a surprisingly high probability that in a (relatively small) room full of people, someone will share a birthday with someone else.

The difference being, rather than looking for the population size that will result in a single "coincidence", I wanted to get an idea of the rate of matching-off and its relationship between both population size and the n that defines the number of sides the n-sided dice has (for the birthday problem, n would be 365).

Enter the python!

import numpy as np
from numpy import random as random
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
import math
from scipy import optimize
%matplotlib inline

symbols=365 # variance or diversity of population
units=1600  # number of objects per run
trials=2000 # defines smoothness
for j in range(0,trials):
    for i in range(2,units):
        #proportion of multiples
        #at least one multiple
        alom.append(1 if len(np.bincount(rolls,minlength=symbols)[np.where(np.bincount(rolls,minlength=symbols)>1)]) else 0)

What's going on here is a number of random trials are setup (in this case 2000) wherein rooms are repeatedly populated with people from 2 to a maximum of 1600 (the units parameter) each unit rolls their n-sided dice (sides defined by the variable symbols and the number of "matches" counted for each contrivance. The results of these are averaged, and the results of these averaged values stored in the multiples_collection array.

Not explored further here, but it's worth noting that the alom_collection describes the averaged scores showing how many of the trials returned at least one multiple which is the specific case for the Birthday Conjecture mentioned earlier.

def exponential_cdf(x,lda):
    return 1 - pow(math.e,-(lda * x))

def exponential_pdf(x,lda):
    return lda * pow(math.e,-(lda * x))

Here, we define the probability distribution functions for an exponential distribution both the pdf and cdf are provided - though we only attempt to fit the cdf here.

lda = optimize.curve_fit(exponential_cdf,x,multiples_collection)[0][0]

This is a little python magic provided by scipy - it returns the best fit value of the lambda parameter for the exponential_cdf function defined earlier - when compared to the data. The joy here is that performing this best-fit parameter finding test is as simple as defining the function in a way that presents its parameters (as shown earlier) and running curve_fit against it.

ax.set_ylabel("Proportion of Multiples")
ax.set_xlabel("Number of Rolls")

Plotting the data against the exponential_cdf with the discovered parameter returns a pair of lines so close to one another, they're almost indistinguishable from one another - a good fit indeed! This means our data can be explained using the exponential_cdf, which provides a shortcut for later modelling, but also implies qualitative information about the underlying process.

Returning the reciprocal of the lda parameter, gives 365 the number of sides the imagined dice has that controls the matching "game" we simulated! So having created a simulation using a great deal of brute-force stochastic trials, we've been able to reduce the results to a far simpler mathematical explanation with only a single, parameter.

At this point, you might be saying "So what?" and fair enough, it is an artificial example - but, imagine we've been collecting data from some unknown process and find that it follows some specific kind of distribution. Using this rather simple set of steps (and an armoury of appropriate distribution functions) we can approximate the distribution mathematically, arriving at both a distribution "type" but also at (estimates of) its fundamental parameters. This is extremely powerful. If I learn that data describing the price of a commodity is normally distributed, that tells me something very different from learning the same thing is exponentially distributed (reaching for an example here, but this should give you an idea). Discovering the parameters of a distribution allows me to run reasonably well tuned simulations, and so make predictions about how those processes will perform in the future.

This only scratches the surface, but I hope I've made the case that this is indeed interesting, and useful.

Tuesday, 10 January 2017


I bought one of these again the other day - actually, it was way back in June - just not been keeping up with blogging very well!. The one I already had was so good, and so I decided to get another one.

The application is marginally different this time around however, as I've got it accepting a digital out signal from my desktop computer.

This means that the unit acts as a DAC, accepting a digital stereo signal and piping it out to an amplifier, optionally recording the signal onto MiniDisc as it happens.

It's a sturdy little unit, and convenient to be able to record snippets of audio as they happen. What's more, with some thought to connections, it acts as functional junction-box between multiple components in a hifi-system, bridging digital and analogue technologies cleanly and without effort.

Interestingly, I've also started incorporating a small mixer unix into my setup so I can control relative volume levels between different separates, as well as gaining additional flexibility on what I can pipe through the system - this might be microphone inputs, or mixing signals from multiple units at once.

I had been considering finding a cheap DAT unit to try alongside, but am yet to be fully convinced - it would also be nice to try and find a matching amplifier and (eventually) connect a DAB radio unit I'd picked up around the same time.

The last six months or so have been exceedingly busy, so lots of things to catch-up on.

Friday, 29 January 2016

Some thoughts on Vectorisation

Last post, I mentioned something about vectorising data.

What does this mean, and why do it?

A vector is a measurement with both a magnitude and a direction. Sometimes the directions are described in terms of a bearing, but it's also conventional to determine a vector as a tuple of values in a series of dimensions, each of which is (normally) considered as being at right-angles to one another in a coordinate system.

The nice thing about vectors in an n-dimensional coordinate system is that they can be manipulated and worked with using linear algebra, enabling the computerisation of functions such as finding nearest matches, categorisation, clustering, network analysis, search algorithms and other useful techniques.

Let's take an example - say I've got the latitude and longitude coordinates of a group of people who are out with their mobile phones. I could calculate their relative distances and use this to pair them off with their nearest counterparts (subject to there being an even number of them, and ignoring tricky 3-way type situations for now). That's basic employment of Pythagoras' theorem.


And then finding the minimum value for each pair.

Using slightly more advanced mathematics, we could group these pairs into clusters, or find the line that best splits them into two, three, or more groups - it's all just an extension of Pythagoras (ok, sorry Linear Algebra folks, I know, that is an over-simplification)

Now consider someone typing in "Piefegoras" into a Google-search. How might Google be able to get from that munged input request yet still relay the results for Pythagoras?

One way might be to "vectorise" the english language. Take each of the 1,000,000 or so English words, and break them down into a 26-dimensional vector space where each dimension represents the number of letters corresponding to that dimension. So the word "aardvark" for example might be represented as the vector (3,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0) - and the word "vector" would be represented as the vector (0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0) . Now in both these examples, there are 26 dimensions, so we might have to tweak Pythagoras a little bit - but it turns out that the distance between the two can be determined using: $$h^2 = a^2+b^2+c^2+d^2+e^2+f^2+ ... +w^2+x^2+y^2+z^2$$

Where each of (a, b, c,...,x, y, z) is the difference between that letter's vector components. So for example, the a value in the above formula would be 3 since there are 3 a's in "aardvark" and none in "vector". So already, we can map out all the words in the dictionary, and indeed all the words not in the dictionary (I looked up "Piefegoras" and it's not there...yet) that use the 26-characters we have in the English alphabet.

Once we can build a map of this vector space, we can do all the same things we did with the people and their 2d latitude and longitude values - cluster words that consist of similar letter counts - indeed, we can use this to classify all words that share the same numbers of letters, i.e. anagrams, or that are 1 letter different, or 2 letters different, and so on - basically, half the puzzles page at the back of the newspaper becomes trivial from now on.

But not only this, we now have a viable spell-checker, or as in the example, a "Did you mean.."-er that will take garbled input and be able to find a nearest neighbour and offer that as a suggestion.

This is a simplification, our initial set of 26 dimensions is ok, but doesn't make the distinction between say, "hatred" from "thread", or "melons" from "lemons". So we need to include an additional set of vectors that are sensitive to letter placement and order, such a set might be those vectors that correspond to each letter pair, such as ("aa", "ab", "ac", ..., "zx", "zy", "zz") this is an impressively large vector-set, consisting of an additional 676 dimensions to a vector space consisting of a total of 702 dimensions. This makes the maths a little more unwieldy, and the storage requirements higher, but hey, it's 2016 and we've got computers. And why stop there, it may be advantageous to include 3-tuples of letters in the set ("aaa", "aab", "aac", ... " "zzx", "zzy", "zzz") or 4-tuples, or 5-tuples and so on - each additional n-tuple exponentially increasing the size of the dimensions under review - so it makes sense to find a sensible balance-point where the additional expense doesn't provide enough utility to warrant the additional complexity. My guess this is around 3-tuples, but 2 is probably enough for most jobs.

So with this improved vector scheme we have, in purely algorithmic terms, a really simple way of breaking down any series of alphabetic characters and comparing its "distance" from any other series of alphabetic characters. The concept of "distance" here is actually very close to our human concept of "similarity" at least in terms of the alphabetic building blocks of two words and that's actually quite powerful.

What's more, by simply extending the input conditions we can include numerics, "foreign" characters, shapes, even pictograms - and still be able to identify closeness. It would be interesting to put something like this together, and see if, using clustering and a bit of human guidance (though it'd be interesting to try otherwise) such a system might be able to help classify English words in terms of their origin. English words from French for example, might share a particular set of features like the inclusion of perhaps ("-re", "-aux", "-ois") perhaps, while Germanic or old English words might contain ( "-ang", "-tz", "-gh" ) or similar - I must confess, I'm no linguist. But, the concept ought to yield if nothing else, a series of cluster-words, the provenance of which, I'd wager, might be shared.

Another fun thing to consider is that we might be able to establish theoretical words - imagine a feature space consisting of all 1-tuple, 2-tuple, 3, 4, 5 and 6-tuple dimensions. If we then populated that space will all the words in the dictionary, from a text, or traweled from the internet, we'd be able to find the most "average" word by picking the one that was most central. Or, more spookily, calculating the most central location in the feature-space, and re-constructing the word that might appear there - despite it never have existed in any dictionary, been spelt out in a forum or commited to paper, print or screen. Performing this reconstructive process - essentially performing the transformation backwards from a location in the feature space is tricky, but if the space were constructed formally enough, this would be possible.

The downsides of this approach, especially if you want to build in the reversibility feature is that you need a massively wide set of dimensions. There are some clever ways to keep these down to a minimum that I might go into later - and there are also some problems inherent in this approach when you're trying to calculate distances between more than say 100 vectors at a time - but we can talk about these later.

To summarise then, this is a technique for converting a one-dimensional string of objects each of which belongs to a finite set (i.e. an alphabet) into an n-dimensional feature-space, where n is the size of the alphabet plus n² for 2-tuples, plus n³ for 3-tuples and so on, for the number of x-tuples to be included to take care of the anagram problem, the purpose of which is to "vectorise" such one-dimensional strings in order to perform "similarity" functions upon them.

It's worth noting here that these techniques are not new, nor particularly original - indeed they're well known and widely publicised. I think they're interesting though, and hope the reader finds something of interest in my discussing them.

I'm working on some code examples at the moment which I'll share along with results once I'm happy.

Wednesday, 9 December 2015

Data and Change

It's an interesting time, right now - I'm not entirely sure I've got a handle on it.

Since Paris, I've done a little freelance work, but mostly been looking for the next long-term job, and in doing so, making something of a minor course-correction.

You see, for a while now, I've been working very much in a niche function of a niche (if global) industry. Even more nichely, the focus has been around a particular software vendor's product and its support. It's served me well, and given me a depth and breadth of experience I might otherwise have found difficult to find elsewhere. I've also benefited from a fairly close-knit network of colleagues, each of whom have a similar set of experiences built up over time. But as time goes by, I'm finding myself interested in areas that don't neatly overlap with recs.

Interestingly, that's quite a statement - there's not much that doesn't somehow share some degree of contextual intersection with software modelled reconciliations - but that's another story.

So in this most recent job-search, I've been trying to take a step away from recs, broaden my horizons, and do something different - ideally, something linked to data-science, analytics or similar. To assist, countless recruitment consultants have called, sent over specs and discussed skills, salaries and start-dates. Sometimes, I've enjoyed the process, and at others, I've found it deeply frustrating. I don't envy the recruitment consultant's role in life. Some of them are good, decent and honest people. Others, and I'm sure these were just having a bad day, didn't fill me full of confidence. Which is fine under normal circumstances - but when you're responsible for a family, you don't want to place your future in the hands of someone who lies at the drop of the hat - least-so, one who lies so unconvincingly.

As I write, there is an offer in the works, for a job whose title hasn't yet been decided, the working arrangements of which are not yet known, but for which I'm very excited. It will mean working with data, and helping others work effectively with data too - and, to an extent not yet entirely known, it may also mean working on new technology and methodologies within an environment of like-minded people.

I'm not sure if it's fair, but this could be my Solsbury Hill moment, which for me suggests a kind of homecoming/becoming thing. I'm sure I'm over-romanticising here.

Upcoming things to think about -

  • What is 'data'
  • What common problems exist when handling/managing data
  • How can we break those problems down into their most abstract form?
  • What tools can we build around those abstract ideas that will generate practical value?
  • How can we leverage the usage differences between mutable and immutable data records?
  • What standards are out there for taxonomies, classifications, models and forms?
  • What are the practical pros and cons of those standards?
  • How can different data be 'vectorised' i.e. turned into a vector?

I'm sure there will be more, but these are what's interesting at the moment.

Thursday, 23 July 2015

Auf Wiedersehen Paris

After an interesting few months, it looks as though my time in Paris is coming to an end.

It's been an interesting assignment, and I've been able to work on developing lots of interesting ideas that hopefully I can take forward into my next role - wherever that will be.

I don't look forward to restarting the process of finding paying work, but it's important to approach these things flexibly, and positively - otherwise, things can quickly spiral into negativity - which is no fun, and is rarely warranted.

I'd like to focus more on data analytics in the future, drawing information, narrative and decision from raw data - and helping others get to grips with the kinds of techniques and applications that can make this happen.

I'd also like to continue working with python - I'm still impressed with the speed with which thorny problems can be tackled - and would like to apply some of these techniques in a more commercial setting. So far, much of my work has been to provide tools, the users of which don't necessarily see or know the internal workings - only the outputs.

So, that's what I'd like - but at some point may have to take something slightly off-track, got to pay those bills.

Yesterday, I started applying to nearly every job pinged over to my email by the JobServe search engine, and so far, not received a sausage in terms of a human reply. Either there's some glaring gap in my CV, or I should relax a little (for now) and wait to see what comes in later - although, characteristically, I notice that I'm faced with a data-analysis problem here

I don't want to apply for any job more than once - but the job-boards are populated with entries from different agencies - more, there are multiple jobs-boards. This means that any particular position is likely to be posted to more than one board, and by more than one agent. This potentially results in duplication - so the question is - how to systematically keep track of what jobs have been applied for, in what areas, and via which agencies?

I think I may have found my next working project...

If last time was anything to go by, finding the next thing could take something like 2 months - but fingers crossed it's quicker than that.

Oh and Paris? Yes, it's been nice knowing you.