I've been working on is consolidating my efforts into continuing to not being unemployed, building up a skills-base should unfortunate turns for the worse manifest.

This last year or so, I've been given a huge opportunity - sticthing together elements from a number of relatively new technologies and trying to form them into a unified, cohesive, useful offering.

I've not talked here about linked-data here before, but it's worth looking into. The idea behind it is to create a web of data points, the same way that we're now familiar with a web of documents. Instead of using HTML as the language for building this web, we use something called RDF. Interestingly, RDF began its life way back in 1997, so itself it's not particularly new.

On its own, RDF is a little underwhelming (though it does have a perculearly elegant structure - some may disagree) but what is cool about it is that it provides a means of easily integrating datasets from all over the place - enabling one to conduct the kind of searches across those integrated datasets that would normally be tricky using regular SQL. Things like dependency chains and friend-of-friend type searches for example. Add in some graph theory (i.e. network theory, not some meta-notions about charts and diagrams), and you're suddenly able to determine cliques, most connected nodes, functional behaviour of different objects, anomaly detection and all manner of other cool and immensely useful stuff.

Fold in a bit of machine learning, textual analysis and some powerful statistical crowbars, and you've got a really decent analytical platform, capable of generating a great deal of insight and business information for a relatively low investment.

So that's 3 separate areas of development I've been working in - and it's been a proper journey. I've coded into the late evenings, consulted O'Reilly, StackOverflow, and a number of extremely gifted colleagues who've revealed to me quite how little I knew. I think that's been simultaneously one of the most inspiring and humbling aspects of all this, having had an opportunity to mix with mathematicians, scientists and engineers and in the process learn and create something new - in the process (hopefully) raising my own game - a year in, I feel a little less of a fraud.

Python has made this practical appreciation of the abstract possible for me - in a way I've not found elsewhere in 30 years - finally, knowing the theory actually counts for something. If, like me, you're shaky on the theory, you're no longer tied down to performing hours of tedious calculations by hand in order to try something out.

Let's do an example - to give an idea of what we're talking about:

I wanted to construct a population model to explore the idea that birth-rates are linked to population density. The hypothesis being that people are fussy and are more likely to find a partner given a large enough pool to choose from. I was imagining a mechanic akin to each person in a population rolling an n-sided dice, where people roll the same number, hurray, they pair-off and start making babies.

This is not dissimilar to the idea of the "Birthday Conjecture" in which it is shown a surprisingly high probability that in a (relatively small) room full of people, someone will share a birthday with someone else.

The difference being, rather than looking for the population size that will result in a single "coincidence", I wanted to get an idea of the rate of matching-off and its relationship between both population size and the n that defines the number of sides the n-sided dice has (for the birthday problem, n would be 365).

Enter the python!

import numpy as np from numpy import random as random import matplotlib.pyplot as plt from scipy.interpolate import UnivariateSpline import math from scipy import optimize %matplotlib inline symbols=365 # variance or diversity of population units=1600 # number of objects per run trials=2000 # defines smoothness multiples_collection=[] alom_collection=[] for j in range(0,trials): multiples=[] alom=[] for i in range(2,units): rolls=random.randint(1,symbols+1,i) #proportion of multiples multiples.append(np.sum(np.bincount(rolls,minlength=symbols)[np.where(np.bincount(rolls,minlength=symbols)>1)])/i) #at least one multiple alom.append(1 if len(np.bincount(rolls,minlength=symbols)[np.where(np.bincount(rolls,minlength=symbols)>1)]) else 0) multiples_collection.append(multiples) alom_collection.append(alom) multiples_collection=np.array(multiples_collection) alom_collection=np.array(alom_collection) multiples_collection=multiples_collection.mean(axis=0) alom_collection=alom_collection.mean(axis=0)

What's going on here is a number of random *trials* are setup (in this case 2000) wherein rooms are repeatedly populated with people from 2 to a maximum of 1600 (the *units* parameter) each unit rolls their n-sided dice (sides defined by the variable *symbols* and the number of "matches" counted for each contrivance. The results of these are averaged, and the results of these averaged values stored in the *multiples_collection* array.

Not explored further here, but it's worth noting that the *alom_collection* describes the averaged scores showing how many of the trials returned *at least one multiple* which is the specific case for the Birthday Conjecture mentioned earlier.

def exponential_cdf(x,lda): return 1 - pow(math.e,-(lda * x)) def exponential_pdf(x,lda): return lda * pow(math.e,-(lda * x))

Here, we define the probability distribution functions for an exponential distribution both the pdf and cdf are provided - though we only attempt to fit the cdf here.

lda = optimize.curve_fit(exponential_cdf,x,multiples_collection)[0][0]

This is a little python magic provided by scipy - it returns the best fit value of the *lambda* parameter for the *exponential_cdf* function defined earlier - when compared to the data. The joy here is that performing this best-fit parameter finding test is as simple as defining the function in a way that presents its parameters (as shown earlier) and running *curve_fit* against it.

fig=plt.figure(figsize=(8,8)) plt.plot(multiples_collection) ax=plt.gca() ax.set_ylabel("Proportion of Multiples") ax.set_xlabel("Number of Rolls") x=list(range(2,units)) uv=UnivariateSpline(x,multiples_collection,k=4) plt.plot(x,v_exponential_cdf(x,lda),linewidth=1)

Plotting the data against the exponential_cdf with the discovered parameter returns a pair of lines so close to one another, they're almost indistinguishable from one another - a good fit indeed! This means our data can be explained using the *exponential_cdf*, which provides a shortcut for later modelling, but also implies qualitative information about the underlying process.

Returning the reciprocal of the *lda* parameter, gives **365** the number of sides the imagined dice has that controls the matching "game" we simulated! So having created a simulation using a great deal of brute-force stochastic trials, we've been able to reduce the results to a far simpler mathematical explanation with only a single, parameter.

At this point, you might be saying "So what?" and fair enough, it is an artificial example - but, imagine we've been collecting data from some unknown process and find that it follows some specific kind of distribution. Using this rather simple set of steps (and an armoury of appropriate distribution functions) we can approximate the distribution mathematically, arriving at both a distribution "type" but also at (estimates of) its fundamental parameters. This is extremely powerful. If I learn that data describing the price of a commodity is normally distributed, that tells me something very different from learning the same thing is exponentially distributed (reaching for an example here, but this should give you an idea). Discovering the parameters of a distribution allows me to run reasonably well tuned simulations, and so make predictions about how those processes will perform in the future.

This only scratches the surface, but I hope I've made the case that this is indeed interesting, and useful.