Matplotlib cookbook

Let me show you how to plot histograms suitable for high energy physics using the Python package matplotlib. I use it to draw figures for publications and such. It's really very nifty, and you can make just about anything. I recently read (full disclosure: the publisher sent it to me, asking for a review) a cookbook for making plots with matplotlib, this guy here by Alexandre Devert. First question you might want to ask is the following: Given that matplotlib is free software, with a quite extensive base of very introductory tutorials, which I can get for free online, why would I pay money for a pdf consisting basically of tutorials? Well, I guess if you don't know anything about Pyhton or matplotlib to begin with -- say you are an interested high school student or just started university and want to make an extra effort -- then I guess this book could be suited for you.
The book basically goes through a number of recipe for plotting various figures, very similar to the ones found in the matplotlib doc. I'm not gonna go on about it here, I would rather show you how I use matplotlib to plot nice histograms for high energy physics.
So, matplotlib is easy. Almost too easy. If you want to plot a histogram, documentation (and the cookbook) tells you to do something like this (in Python): import numpy
import matplotlib.pyplot as plt

values = numpy.random.randn(1000)
plt.hist(values)
plt.savefig("hist.jpg")
to get something like this figure: The simple histogram
Very easy. Almost too easy. Problem is: this is not how I, nor anyone I know, works. I use histograms a lot. Basically any experiment you can do in high energy physics involves counting, and histograms are the way to visualize this. I cannot work with big, rectangular bins, I need errorbars, and I need to be able to put several histograms in one plot and actually be able to view the contents. I would also need access to bin contents along the way, to produce eg. the ratio of multiple histograms. I am fully aware that the philosophy of matplotlib is to make the easy very easy, but I might argue that the example above is a bit too easy to be useful. Luckily I can do what I want, and I will walk you through the process.
import numpy
import matplotlib.pyplot as plt

# Give me the same values as before
values = numpy.random.randn(1000)

# I use the numpy histogram as it is easier to get its contents
# In real life application I use a dedicated histogramming package
# Second argument is my choice of binning
_bins, _edges = numpy.histogram(values,numpy.arange(-5,5,0.1))

plt.plot(_edges[:len(_edges)-1],_bins)

plt.savefig("hist2.jpg")
which would give the figure below:
The less simple histogram
This is not way more complicated, but very neccesary to provide the tools for what we're going to do now. We could provide errorbars. Only problem is that the builtin function errorbar() has the same problems as hist(). So I draw my own, simply by drawing a vertical line. I also want to emphasize that this is binned data without drawing very wide bins. I do that by marking bin centers with a small o. All in all it's just a matter of adding a function to calculate the error, and then drawing it:
import numpy
import matplotlib.pyplot as plt

def error(bins,edges):
# Just estimate the error as the sqrt of the count
err = [numpy.sqrt(x) for x in bins]
errmin = []
errmax = []
for x,err in zip(bins,err):
errmin.append(x-err/2)
errmax.append(x+err/2)
return errmin, errmax

# Give me the same values as before
values = numpy.random.randn(1000)

# I use the numpy histogram as it is easier to get its contents
# In real life application I use a dedicated histogramming package
# Second argument is my choice of binning
_bins, _edges = numpy.histogram(values,numpy.arange(-5,5,0.5))

bins = _bins
edges = _edges[:len(_edges)-1]
errmin, errmax = error(bins,edges)
# Plot as colored o's, connected
plt.plot(edges,bins,'o-')
# Plot the error bars
plt.vlines(edges,errmin,errmax)

plt.savefig("hist3.jpg")
which would give something like this:
Histogram with errors
I know we are somewhat beyond introductory now, but let me proceed, as this is stuff I would have loved someone to tell me at some point. Now let me add another histogram, and plot both, along with a ratio of the two. This is also something I use all the time, as a ratio plot is excellent for showing the difference between eg. data and simulation. I have also added a legend and an axis label such that it is visible how to do something like that. Happy plotting.
import numpy
import matplotlib.pyplot as plt

def error(bins,edges):
# Just estimate the error as the sqrt of the count
err = [numpy.sqrt(x) for x in bins]
errmin = []
errmax = []
for x,err in zip(bins,err):
errmin.append(x-err/2)
errmax.append(x+err/2)
return errmin, errmax

def getRatio(bin1,bin2):
# Sanity check
if len(bin1) != len(bin2):
print "Cannot make ratio!"
bins = []
for b1,b2 in zip(bin1,bin2):
if b1==0 and b2==0:
bins.append(1.)
elif b2==0:
bins.append(0.)
else:
bins.append(float(b1)/float(b2))
# The ratio can of course be expanded with eg. error
return bins

# Give me the same values as before
values = numpy.random.randn(10000)

# Also values for another histogram
values2 = numpy.random.randn(10000)

# I use the numpy histogram as it is easier to get its contents
# In real life application I use a dedicated histogramming package
# Second argument is my choice of binning
_bins, _edges = numpy.histogram(values,numpy.arange(-5,5,0.5))
_bins2, _edges2 = numpy.histogram(values2,numpy.arange(-5,5,0.5))

bins = _bins
edges = _edges[:len(_edges)-1]
bins2 = _bins2
edges2 = _edges2[:len(_edges2)-1]

errmin, errmax = error(bins,edges)
errmin2, errmax2 = error(bins2,edges2)

fig = plt.figure()
ax = fig.add_subplot(2,1,1)

# Plot as colored o's, connected
ax.plot(edges,bins,'o-',color="blue",lw=2,label="Blue hist")
ax.plot(edges2,bins2,'o-',color="red",lw=2,label="Red hist")
leg = ax.legend()
# Plot the error bars
ax.vlines(edges,errmin,errmax)
ax.vlines(edges2,errmin2,errmax2)

# Plot the ratio
ax = fig.add_subplot(2,1,2)
rat = getRatio(bins,bins2)
ax.plot(edges,rat)
ax.set_ylim(0,2)
ax.set_xlabel("Variable name")

plt.savefig("hist4.jpg")

Histogram full showoff

Machine Learning Contest

I have come across a contest that looks really exciting. The people over at Kaggle are hosting a programming contest about particle physics. Kaggle is a rather interesting platform, which hosts contests involving "big data". You know, the much celebrated buzzword. The idea behind the contests is, that all participants is given a large histotic dataset or a simulation, and based on that, using any technique they want, they are supposed to predict future behaviour. Say for example that an insurance company would like to know what kinds of extra insurance they could sell to existing customers. They would supply the Kaggle contest with historic sales data, and based on that, contestants could for example find that they should focus on selling life insurance to a certain subgroup of people, to maximize the value from their effort. The means to do such an analysis, is often various machine learning techniques. Luckily these can also be used to solve problems that are actually interesting... The problem that needs to be solved is rooted in the fact that even though the aim of the large particle physics experiments, like ATLAS , is to produce exotic particles such as the much celebrated Higgs boson, such particles are never seen directly in the detector, as they decay even before they leave the innermost part of the detector; the pipe in which the proton beams collide. All the physicists at the experiments can do is to try and calculate "back" from the stuff the Higgs bosons decay to. The efficiency of such calculations is of course an important matter, as it will determine how well the Higgs bosons can be observed. The contest at kaggle wants participants to employ machine learning techniques to do this calculation for a certain type of Higgs decays. As this is something I have done before in one of my courses, albeit for another decay channel, this is of course something I want to to, probably over the summer. My idea is to take this opportunity to learn another of the free, open source frameworks for image analysis techniques, perhaps OpenCV. My idea is to use traditional techniqes like boosted decision trees, but with the caveat that I dont want to just train them against a signal background sample, but instead train against the different types of background vs. the signal. I suspect that this would allow me to achieve better separation, as different processes would populate phase space differently. If you think this sounds interesting and would like to know more, you should head over to Kaggle. If you already know something, especially about more exotic machine learning techniques, and want to form a team, feel free to contact me!

Open election data

After the 2014 election for the European parliament, I did a fun, small project with open election data. Turns out that the Danish national office for statistics makes data about all polling stations freely available to the public though their website (largely in Danish). The data are basically all personal and party votes from all polling stations. Since it is possible to view historical data as well, it is possible to correlate with results from earlier elections, and see if voters have moved. I wrote two simple Python functions to a) parse a url for election information using an xml library and b) use that parsed information to build a tree of votes. The first one is pretty simple: import xml.etree.ElementTree as et import urllib def getroot(url): name = url.split('/') elecname = name[-3] filename = name[-1] if os.path.isdir(elecname): try: f = open(elecname + "/" + filename) f.close() except: urllib.urlretrieve(url,elecname + "/" + filename) else: os.mkdir(elecname) urllib.urlretrieve(url,elecname + "/" + filename) f = open(elecname + "/" + filename) tree = et.parse(f) f.close() return tree.getroot() As dst puts all information about elections in separate files for separate polling stations, all info written in an xml file in their webpage, it makes sense to just parse that information and build a tree from it. Try to call the above script with such an xlm file as the argument, like: ft09 = "http://www.dst.dk/valg/Valg1204271/xml/fintal.xml" tree = getroot(url) Now the election data can be extracted on a district level using the following function. (I apologize for the fact that the code is partly written in Danish. This is of course due to the fact that everything from dst.dk is in Danish.) def kredsniveau(plist, pshort, root): kredse = [] for child in root: if child.tag == "Opstillingskredse": for sub in child: k = Kreds() k.name = sub.text.strip('1234567890.').strip() kredsroot = getroot(sub.get("filnavn")) for ch in kredsroot: if ch.tag == "Status": k.status = int(ch.get("Kode")) elif ch.tag == "Stemmeberettigede": k.voters = int(ch.text) elif ch.tag == "Stemmer": for party in ch: if party.get("Bogstav") is not None: let = party.get("Bogstav") if let == u'\xd8': let = "E" setattr(k, let + "votes",party.get("StemmerAntal")) elif party.tag == "IAltGyldigeStemmer": k.validvotes = int(party.text) elif party.tag == "IAltUgyldigeStemmer": k.invalidvotes = int(party.text) elif ch.tag == "Personer": for party in ch: for person in party: for p,q in zip(plist,pshort): if person.get("Navn") == p: setattr(k,q,int(person.get("PersonligeStemmer"))) kredse.append(k) return kredse And now the fun begins, as we can loop through the districts (kredse) and extract data. For example to compare votes on the Socialistic peoples party in the local election for parliament in 11 to the EP election in 14, and plot it using matplotlib: ep14 = "http://www.dst.dk/valg/Valg1475795/xml/fintal.xml" root = getroot(ep14) ep14kredse = kredsniveau(plist,pshort,root) ft09 = "http://www.dst.dk/valg/Valg1204271/xml/fintal.xml" elist = [] eshort = [] root = getroot(ft09) ft09kredse = kredsniveau(elist,eshort,root) x = [] y = [] label = [] xnorm = 0 ynorm = 0 for ep,ft in zip(ep14kredse,ft09kredse): if ep.status ==12: try: xval = float(ft.Fvotes)/float(ft.validvotes) yval = float(ep.Fvotes)/float(ep.validvotes) x.append(xval) y.append(yval) label.append(ep.name) except: print "Problem at ",ep.name import numpy import matplotlib.pyplot as plt plt.figure() plt.subplot(111) ax = plt.scatter(x, y) plt.xlabel("SF (FT09) (%)") plt.ylabel("SF (EP14) (%)") plt.show() Adding a bit of flavour to the above by writing out names of outliers produces a figure like this: SF correlation plot
So SF clearly did a better job at the EP election than at the last election for local parliament (FT).The district of Lolland is funny since SF did significantly worse for EP than for FT. My guess would be that the votes there were stolen by the EU sceptics from the Danish Nationalist Party. Looking at personal votes, the top candidates from the two EU-sceptical parties are fun to correlate. Morten Messerschmidt is from the Danish Nationalist Party and Rina Ronja Kari is from the center-left Peoples Movement against EU. Rina/Morten correlation plot
The correlation plot clearly shows that even though Kari and Messerschmidt are both EU sceptical, they get votes from different people. It clearly shows that Rina is comparably stronger in places that traditionally have strong support of left wing parties. You are more than welcome to download and play with the whole thing from here.

Using Raspberry Pi as a git server

I use git for version control of everything. It is so damn convenient. For every project I have, I simply maintain a git repository sitting on my Raspberry Pi, which this website is also hosted from. This is a small howto explaining how to make such a setup.
You need an internet connection and a small computer running some flavour of Linux. A Raspberry Pi will do. If the machine you use for development is a private machine, I would recommend setting up password-less ssh login. As git works over ssh, this will allow you to do git operations without providing a password.
To setup ssh without password, such that you can login from user@dev to user@pi do: user@dev~> ssh-keygen -t rsa Just click ok on all the prompts, do not provide a password. You have now generated a key. This needs to be copied to the Pi, in the .ssh directory. If such a directory does not already exists (it usually does), do: user@dev~> ssh pi "mkdir -p .ssh" Enter your password when prompted. Now just copy your key by doing eg.: user@dev~> scp .ssh/id_rsa.pub pi:.ssh/authorized_keys You need to enter your password again. This operation will overwrite existing authorized_keys on your pi. If you want to keep those, you need to append your new key to the end of the file instead. Then do: user@dev~> cat .ssh/id_rsa.pub | ssh pi "cat >> .ssh/authorized_keys" You of course have to enter the password. Now you can do ssh to pi without password. We will now setup a new project to be version controlled with git. Go the folder on dev which holds your project, and setup a new git repo: user@dev~> cd my_project/ user@dev~> git init If you have never used a version control system like git before, here is a quick rundown. You can do all the changes you want on the local machine without affecting anything else. This is particularly nice if you are working with other people, as you do not need to worry about what other people are doing that might affect the stuff you are trying to do right now. Once you are satisfied with your work, you add the files you changed, and commit your work. In this small example that would amount to: user@dev~> git add . user@dev~> git commit -m "Initial commit" Now everything in the currect directory is committed, and we want to setup the remote repository on the Pi: user@dev~> ssh pi user@pi~> git init --bare my_project.git user@pi~> exit All that is left now is to push the commit on the local machine to the Pi (the remote). You need to specify where the remote is: user@dev~> git remote add origin pi:my_project.git The first time you push the remote needs to be specified (in the future you just need to do git push): user@dev~> git push -u origin master Now you have a git repository setup for this project. You can clone it from other machines by doing: user@dev2~> git clone pi:my_project.git and all the other stuff you want to do with version control.
There are plenty of options for free hosting of git repositories. The github is probably both the largest and the most well known. Using such external hosting options comes with both benefits and drawbacks. I am currently trying to pull all the various services I use to servers I have physical access to. This is why my git repos are hosted on a local machine, as well as this website. Besides from the obvious advantages -- no provider has direct acces to my data, and I, and I alone, am to blame for missing backups and other faults -- I have great fun in learning how such services works under the hood.

Summerschool in Zakopane

I went to a really nice summer school on high energy physics in Zakopane (Cracow School of Theoretical Physics 2014). Zakopane is situated in the Tatra mountains of southern Poland, a place I will seriously consider returning to for holliday. Look at the beautiful scenery from the mountain just outside the school! View from the school
Luckily the school was not mostly about the view -- it was about understanding collisions of protons at very high energy and collisions of heavy atomic nuclei. When I say very high energy, that is a statement with some ambiguity. Since protons are constituent objects, consisting of quarks and gluons, you don't really collide protons; you collide the constituents. This means that two different energies (or 'scales' as we would say) enters the problem. We have the full proton energy, and then the fraction of proton energy going into the collision of constituents, the so-called 'hard collision'. To enable a calculation of the hard collision, one needs to have some kind of model for how the proton is populated with quarks and gluons. It turns out that this is very dependent on the fraction of energy taken out for the hard collision. When this fraction goes very small, the amount of gluons rise dramatically, and the proton gets 'overpopulated' with gluons. A lot of different theories about this overpopulation exists, but a particularly popular variant of this predicts that the proton will simply saturate with gluons at one point. I am working with such a model myself, and so far we have not experimentally seen clear indications of such saturation effects. But as the energy of the experiment rises, more and more interesting collisions for observing saturation effects will be made, and so we are very interested in understanding our models better before the LHC starts making collisions again in 2015.
I made a presentation about my own work, which was fairly well recieved. If you want to look at the slides, you can find them here. View from the school
I also had time to play a little bit of music. At the conference dinner a very nice group of Polish folk singers played the traditionals of the region, and I got to be invited on stage to play a couple of songs with them on a borrowed guitar. Above is a (bit grainy) picture of the folk singer group in the traditional outfit of the region.

Statistical physics

The course on statistical physics covers both the basics of thermodynamics, introduces statistical physics, and then goes on to introduce some advanced methods for analysing the statistical physics of phase transition. The book is Statistical Mechanics by Chandler. The course features two rather large handin exercises. You can find my attempt at a solution here and . You can get my (handwritten) exam notes I think they can provide a pretty good idea about the contents of the course.

Quantum field theory

The course on quantum field theory is a standard first-half of Peskin and Schroeder kind of course, much emphasis on calculations, and a lot of work done on hand-in exercises. This is a need-to-have course if you want to do anything related to particle physics, theoretical or experimental. I have a lot of lecture notes (handwritten), which you can get in a tarball here and of course some handin exercises which you can get in another tarball .

Pedagogy

Ph.D students, who are teaching, must go through an introductory pedagogy course. Even though I have had such courses before, I thought it was a very interesting experience. Not least because the course was both international and interdisciplinary, so everybody had a nice opportunity to meet with graduate students from diffent fields and cultures. Turns out that every Ph.D student teaching at a university level, pretty much faces the same difficulties, regardless of subject. The course ended with all participants writing a short treatise on a pedagogical issue they are facing themselves, and possible solutions. I wrote about using kinesthetic (that means something you do with your body) exercises in physics teaching. It was quite fun to write, and I think it actually turned out ok. You can read it here.

Monte Carlo and Molecular Dynamics

This is a course from the COMPUTE research school. It covers the basics of Monte Carlo and Molecular Dynamics simulation techniques, and then goes on to use the methods on five different real world cases (to be worked on in exercises), presented by an expert from each field. In the end, a project covering one of the areas in more rigorous detail was made.

Electrodynamics

The LU course on electrodynamics is an obligatory read-it-yourself-course, which basically consists of reading parts of the classic treatment by Jackson, and solving exercises. Oh the exercises. I am currently taking this course, and I will upload my course material, and a bit more of a lucid description one it's over with.

Colours and Flavours

The course Colours and Flavours is part of the LU two-course treatise on phenomenology of the Standard Model, the other being a course on Monte Carlo Generators. The course starts with a rundown of the Standard Model, goes through mixing in the quark sector (CKM) and the neutrino sector. The bulk of the course treats various aspects of low energy QCD in various approximation schemes. The course ends with a rundown of SUSY.
The course home page. Still contains various info and a litterature list. The course book was Standard Model by Donnagie et al. Not recommendable as a first book on particle theory, but very good to give an overview of quite a lot of subjects within SM physics and various EFT.

Computer vision

The course on computer vision was a special course within the framework of the COMPUTE research school. The aim was to give graduate students from very diverse fields an introduction to computer vision theory and technology, and provide means to enable students to use these tools in own research. It was very interesting to meet and greet with other people from various fields. I especially recall a guy who made his Ph.D taking pictures of flora and fauna around the world, using stationary camaras, which were setup to take a picture a couple of times a day. He would then use computer vision methods to decide state of vegetation, time of year and so on, automatically from the pictures. I did a project where I separated Higgs particles from a very noisy background (the b-bbar channel), using various multivariate methods. I was very surprised how good the results were, and if anyone reading this would like to look at boosted Higgs particles in the b-bbar channel, they are very welcome to get my final report here or to look at my presentation . Multivariate methods are very fascinating to me, and probably one of the neccesary ingredients to make real AI, artificial intelligence. It will definitely be something I will pursue further at some point in the future.

A course Classical mechanics

I finished the LU course on classical mechinacs. Yay. The Lund University course on advanced classical mechanics, goes through the Lagrangian and Hamiltonian formalism, covering all of(!) the tome of this subject by Goldstein et al., which you can read more about in my books section. There is not much interesting to say about such a course, but I did manage to do a bunch of the Goldstein exercises, which you can download (handwritten!) herebooks section. There is not much interesting to say about such a course, but I did manage to do a bunch of the Goldstein exercises, which you can download (handwritten!) , and .

Triple gauge boson couplings

I wrote my Master's thesis within the ATLAS experiment on the subject of Triple Gauge Boson Couplings (in short, TGCs). The TGC model is an extension of the Standard Model of Particle Physics, to include processes which are not otherwise allowed. Extending the Standard Model in this way, allows for searches of new physics, without having to refer to a specific theory such as technicolor, supersymmetry or the like.
If you want to know more, you are welcome to go ahead and read my Master's thesis.