Tuesday, March 18, 2008

Fun with gnuplot



*Disclaimer : I'm a computer geek, not a scientist, the science is probably wrong, but I hope it's clear enough for you to draw a graph that means something*

Like the geek other half of first year biomedical students everywhere (Yeah there must be loads of em) I find i spend far too much time trying to get excel to do best fit lines on data series. And then, after all that hard work, and extracting the straight line formula, plugging in the unknown and getting the value out, what does your beloved do? Use the word drawing toolbox to hand paint a line onto the graph because excel can't do the 1 thing you really want which is a graphical representation of how you arrived at a particular value.

I'm no scientist (By a long way, I daren't even experiment on my own head) but I do know that generally, for solutions of unknown strength you can take a series of samples of known strength, run those samples through some analysis process such as spectrophotometry and then calculate your unknown using the same process. I'm also not the worlds biggest fan of office, and felt slightly let down by OpenOffice capabilities in this area too. Of course I needn't.. like all open source tasks, it's a question of right tool for the right job. instead of spending 3 weeks trying to make excel do something it can't I should have invested an hour in gnuplot. Here's the output of my lunchtime play.

I've got a file called exp_1.dat of space separated values:

1 1
2 2
3 5
4 6
5 30
6 29
7 28
8 27
11 3
22 23
25 27
26 40
27 30

The values are deliberately iffy so I can see the effects of some different operations on the graph. Specifically, the trend of values from 5 to 8 are a descending line opposing the general upward trend of the series. This is to test those occasions where you want to omit the acceleration and deceleration phase.

And here's my script.. it runs as is under linux when gnuplot is installed, i understand there is a windows runtime environment for gnuplot too.

---- Start ----
#!/usr/bin/gnuplot
# My first attempt at a plot that just fits an straight line to the entire set of points
# in the source data file. Best fit, but with just a specific sub-range of data, testing
# best fitting when excluding accelleration and decelleration phases of a reaction.

reset

# Manually force x and y range, can omit, just for clarity here
set origin 0,0
set xrange [0:40]
set yrange [0:40]

# plot of size 800x600
set terminal svg size 800,600

set output 'graph2.svg'

# Our first best fit line, for all data values
b = 0.0
m = 0.0
f(x) = m*x + b
fit f(x) 'exp_1.dat' via b, m

# Fitting only x values between 5 and 8, in practise we look at the values and decide
# where the acceleration and deceleration phases start and end
b2 = 0.0
m2 = 0.0
f2(x) = m2*x + b2
fit [5:8] f2(x) 'exp_1.dat' via b2, m2

# Here is where we'd do a bit of math to re-write y=mx+b the values for b2 and m2
# and plug in our required y value in order to arrive at
# Our test input value is 17
iy = 17
ix = ( iy - b2 ) / m2

# Some output
top_title = sprintf("Test best fit for subset outside acceleration and deceleration");
set title top_title
set timestamp "Last updated: %d/%m/%Y, %H:%M" top

t = sprintf("%1.3fx+%1.3f",m,b)
set label t at 20,16

t2 = sprintf("%1.3fx+%1.3f",m2,b2)
set label t2 at 5,13

set xlabel "My X Label"
set ylabel "My Y Label"

set key left
set grid

# Draw some nice lines from our unknown sample to the calculated absorbtion
set arrow 1 from -1,iy to ix,iy nohead
set arrow 2 from ix,iy to ix,0
# Add A pretty label
a = sprintf("Absorbtion at %f = %f",iy,ix)
set label a at ix+3, 3

# Draw it!
plot 'exp_1.dat' title "Source Data", f(x) title "Best Fit 1" , f2(x) title "Best Fit 2"

---- End ----

The net result is to the right. The script as is outputs SVG, scalable vector graphics, but for display on the blog I had to output as a gif. The SVG is (To me) much nicer looking and it has the advantage of being, well, scalable. Even better, the script is relatively generic, so any data series can be plugged in and all the user needs to do is set the unknown sample measurement in y1 and it will auto draw a line and point out the required x value. Really quite neat, and best of all, can be incorporated into an automated process.

Well I don't know if it's useful, but it beats real work.