How we used research data in Arloesiadur

Juan Mateos-Garcia | June 1, 2017

Research in universities can generate economic benefits through a variety of pathways: new knowledge created at universities is applied by businesses, public sector and not-for-profit organisations outside, highly skilled graduates go to work in industry, and experts provide advice to people outside of academia. But realising these benefits requires strong networks between university and industry. In Arloesiadur, we analyse the research landscape - the domains where research is taking place, and these collaboration networks using an open dataset about UK Research council funded projects, the Gateway to Research (GtR).

In this blog, we overview our analysis and its limitations. You can check out the code, and access the processed data here. We have written up some of this in a paper we presented at the Data for Policy 2017 conference. Download it here.

Why the Gateway to Research?

There are many datasets one could use to analyse research activity: there are publication data, patents and information about access to European research funding through the Cordis dataset, just to name a few. We decided to go for Gateway to Research because:

  • It is very timely, including information about research projects that just got funded but have not yet generated any outputs,
  • It contains information about projects from all disciplines, including types of knowledge that are rarely embodied in patents or even papers but can be very important for innovation in some sectors, as is the case with Arts and Humanities,
  • It has information about collaborations between university and industry, allowing us to map research networks beyond academia.

This is not to say that those other datasets are not useful. We are in actively considering ways to incorporate them into future analyses.

Data collection and cleaning

The Gateway to Research data is available through an open Application Programming Interface (API) with a variety of endpoints. Through these, we downloaded information about projects, organisations and funders. These datasets contained many variables that we were interested in, such as the projects that had been funded and their topics (in the project dataset), the organisations that had participated in projects and their location (in the organisation dataset) and the funding awarded to projects (in the funder dataset). There were other datasets of interest which we did not analyse this time, such as individual researcher information and research outputs. We will look for opportunities to do this in the future.

We worked with a dataset of 72,592 projects. One of our big interests was to monitor levels of activity in different research areas in Wales. This would allow us to map Wales’ research specialisations against the sectors identified in Welsh Government’s Science strategy, and to identify the research capabilities in different locations and organisations. This led us to exclude from the analysis those projects that did not have any research subject information or abstract. This included things such as Studentships, Knowledge Transfer Partnerships or projects supported by Innovate UK, and took us down to 33,373 projects (90% of which are research grants).

We had to follow a rather complicated strategy to classify projects into research areas and research topics. Initially, we used tags (e.g. ‘microeconomics’, ‘robotics’, ‘materials’) given to projects by funders to draw a network of research activity (where the tags that tended to appear in the same projects were linked to each other), and we then used community detection methods to look for tightly knit ‘tag communities’ in that network.

Through this analysis, we identified a list of 7 quite intuitive research areas (Arts and Humanities, Engineering and Technology, Environmental Sciences, Life Sciences, Mathematics and Computing, Physics and Social Sciences) that mapped well against the research funding councils (AHRC, EPSRC - primarily funding projects in both Engineering and Technology and Mathematics and Computing - NERC, BBSRC, STFRC and ESRC). We classified projects to the research area for which it had more tags. If there was a draw at the top, we allocated it to one of its top areas randomly. When we analysed levels of activity over time, we found a couple of interesting things:

  1. We found  99.8% of the projects in the data had a start date of 2006 or later, consistent with the idea that GtR primarily covers research funded in the last 10 years or so.
  2. The research tags we had been relying on to classify projects into communities are used inconsistently over time and research domains: the Biotechnology and Biological Sciences Research Council (BBSRC) only started tagging its projects in  2011, and the Medical Sciences Research Council never used tags (relying instead on ‘health categories’. In total, 5,962 projects funded by the MSRC lacked tags, and the same was true for 4,040 projects funded by the BBSRC. As many as 1,046 EPSRC project were untagged as well. To address this, we trained a supervised machine learning model on the datasets of (generally more recent) projects we had managed to label through the community detection approach. We used the text in their abstracts and their funders as a predictor (the MSRC funded projects were simply labelled as ‘Medical Science’). We then predicted the disciplines in the unlabelled dataset with this model. We allocated projects to the discipline where the model estimated the biggest probability, except in those cases where this probability was below 0.3 (we kept those unlabelled). By the end of this process we had gone down from 6,721 unlabelled projects to 565.

Generating higher resolution research topics

We wanted to dig below the 8 research areas we had identified in our analysis, but doing this was not easy. An initial analysis of the whole corpus using topic modelling algorithms (i.e. Latent Dirichlet Allocation, or LDA), which identifies clusters of terms that appear in the same documents, and measures the relative importance of these topics in each of the documents of a corpus, generated very noisy results. A visual inspection suggested that the algorithm was getting confused by the heterogeneity of languages used in different research disciplines. To address this, we trained a LDA model inside each discipline, extracting 200 topics. The results were much more intuitive.

We then predicted the topic distribution for each project with these models. Acknowledging  the possibility that a project might have topics from several disciplines (e.g. by definition if it is interdisciplinary) we fit models for all disciplines in all projects but we weighted the probability of a discipline’s topic in a project by the probability that the project was in that discipline in the first place (based on the supervised models we had trained when cleaning the data).

This gave us, for each project, a vector with around 1,600 values representing its weights in 200 topics for 8 disciplines. Although this data had high resolution - just as an example, it included topics such as “bee, colony, pollinator, landscape, crop, specie, honeybee, bumblebee”,  “theory, string, quantum particle, physic, black hole, gravity”,  “graphene, plastic, flexible sheet, tube, printed, substrate, layer” or “manufacturing process, fabrication, printing, additive, technique, precision, material”, which capture highly specific research topics of interest to policymakers, it was at the same time difficult to report so many of them, and we were concerned that the data would be noisy. 

To simplify things, we produced a topic network inside each discipline based on their jaccard distance (presence of topics or not in different projects), and once again performed a community detection analysis to identify clusters of topics, resulting in a final set of 88 research topics that we report in the visualisations.

Reporting local specialisations

Since we had geo-coded all organisations in the GtR data, it was relatively easy to ‘bin’ projects into regions and nations (i.e. Wales) and principal areas. But how could we classify projects into research topics? We opted for reporting slightly different things depending on the visualisation.

  • For the trend chart, we classified each project in its top research topic, and reported number of funded projects and totals raised by projects led by organisations based in the area. This eliminates the risk of double counting.
  • In the heatmap, we were more interested in representing the ‘research capabilities’ present in the location, so we classified projects on their top 3 research topics, and counted any funded projects with participation from organisations in the location (regardless of whether they led the project or not). This means that there will be some double counting where projects contain more than one topic, or involvement from more than one organisation.

The recommendation engine

Finally, we were interested in identifying opportunities for collaboration between different organisations in Wales. We decided to represent this information using a ‘recommendation engine’. To build this,  we created a ‘base map’ capturing actual research collaborations between Wales-based organisations.

We then had to find as yet unfulfilled opportunities for collaboration. Our logic was not that different from what Amazon or Netflix do: in the same way in which consumers might be interested in products purchased by others similar to them, we assumed that organisations might be interested in collaborating with those organisations collaborating with others who are similar to them.

Although the basic notion was simple, implementation was anything but: We began by calculating a research specialisation profile for each organisation based on the research topics of the projects they participate on (we tagged projects with their 5 most important topics).  We then calculated similarities between organisations based on cosine distance between their specialisation profiles, and identified, for each organisation, the 10 most similar to it.

We then took each organisation (alter), and looked for the organisation (ego) that was most similar to it and extracted its top research collaborators, filtering out those of ego’s collaborators that had never participated in a Research Area that the alter was active on, and those organisations already collaborating with alter. We used this information to create an alternative opportunity network’, where every node (organisation) has a maximum of 5 connections with the top collaborators of the organisations that are similar to it.

Back to the top