Modern science produces a lot of data; whether outputted from experimental apparatus or as the result of simulation, the volume of information a scientist can produce is now often radically greater than it was even ten years ago. It follows that there is now a need for new tools to enable scientists to filter, mine and search both their own data and that produced by other researchers.
One of the major sources of experimental data is in the “supplementary data” attached to publications in journals. Our CrystalEye repository exploits this by aggregating, and converting to CML , the supplementary data from journals which publish crystal structures.
The question then becomes how to add value to this data. One approach is by enhancing the searchability and discoverability of the data – a task for which RDF in general, and SPARQL in particular, is well-suited. We therefore, using our Golem ontology language and pyGolem toolkit (which enable the layering of richer semantics onto CML), extract metadata from CrystalEye as RDF, and use it to build new interfaces to the repository – thus making the data therein easier to find, analyse and reuse.
I’m a researcher in the Unilever Centre, part of the Chemical Laboratory at the University of Cambridge, working on building systems and languages for the representation and mining of large volumes of chemical data. Before that, I worked in theoretical chemical physics, designing algorithms for the prediction of diffusion and chemical reactions by atomistic simulation.
I’m part of the MaterialsGrid project, within which my major research interest is the Golem ontology language/toolkit. I blog at Brighten the Corners.