Juan Mateos-Garcia | June 1, 2017
Research in universities can generate economic benefits through a variety of pathways: new knowledge created at universities is applied by businesses, public sector and not-for-profit organisations outside, highly skilled graduates go to work in industry, and experts provide advice to people outside of academia. But realising these benefits requires strong networks between university and industry. In Arloesiadur, we analyse the research landscape - the domains where research is taking place, and these collaboration networks using an open dataset about UK Research council funded projects, the Gateway to Research (GtR).
In this blog, we overview our analysis and its limitations. You can check out the code, and access the processed data here. We have written up some of this in a paper we presented at the Data for Policy 2017 conference. Download it here.
There are many datasets one could use to analyse research activity: there are publication data, patents and information about access to European research funding through the Cordis dataset, just to name a few. We decided to go for Gateway to Research because:
This is not to say that those other datasets are not useful. We are in actively considering ways to incorporate them into future analyses.
The Gateway to Research data is available through an open Application Programming Interface (API) with a variety of endpoints. Through these, we downloaded information about projects, organisations and funders. These datasets contained many variables that we were interested in, such as the projects that had been funded and their topics (in the project dataset), the organisations that had participated in projects and their location (in the organisation dataset) and the funding awarded to projects (in the funder dataset). There were other datasets of interest which we did not analyse this time, such as individual researcher information and research outputs. We will look for opportunities to do this in the future.
We worked with a dataset of 72,592 projects. One of our big interests was to monitor levels of activity in different research areas in Wales. This would allow us to map Wales’ research specialisations against the sectors identified in Welsh Government’s Science strategy, and to identify the research capabilities in different locations and organisations. This led us to exclude from the analysis those projects that did not have any research subject information or abstract. This included things such as Studentships, Knowledge Transfer Partnerships or projects supported by Innovate UK, and took us down to 33,373 projects (90% of which are research grants).
We had to follow a rather complicated strategy to classify projects into research areas and research topics. Initially, we used tags (e.g. ‘microeconomics’, ‘robotics’, ‘materials’) given to projects by funders to draw a network of research activity (where the tags that tended to appear in the same projects were linked to each other), and we then used community detection methods to look for tightly knit ‘tag communities’ in that network.
Through this analysis, we identified a list of 7 quite intuitive research areas (Arts and Humanities, Engineering and Technology, Environmental Sciences, Life Sciences, Mathematics and Computing, Physics and Social Sciences) that mapped well against the research funding councils (AHRC, EPSRC - primarily funding projects in both Engineering and Technology and Mathematics and Computing - NERC, BBSRC, STFRC and ESRC). We classified projects to the research area for which it had more tags. If there was a draw at the top, we allocated it to one of its top areas randomly. When we analysed levels of activity over time, we found a couple of interesting things:
We wanted to dig below the 8 research areas we had identified in our analysis, but doing this was not easy. An initial analysis of the whole corpus using topic modelling algorithms (i.e. Latent Dirichlet Allocation, or LDA), which identifies clusters of terms that appear in the same documents, and measures the relative importance of these topics in each of the documents of a corpus, generated very noisy results. A visual inspection suggested that the algorithm was getting confused by the heterogeneity of languages used in different research disciplines. To address this, we trained a LDA model inside each discipline, extracting 200 topics. The results were much more intuitive.
We then predicted the topic distribution for each project with these models. Acknowledging the possibility that a project might have topics from several disciplines (e.g. by definition if it is interdisciplinary) we fit models for all disciplines in all projects but we weighted the probability of a discipline’s topic in a project by the probability that the project was in that discipline in the first place (based on the supervised models we had trained when cleaning the data).
This gave us, for each project, a vector with around 1,600 values representing its weights in 200 topics for 8 disciplines. Although this data had high resolution - just as an example, it included topics such as “bee, colony, pollinator, landscape, crop, specie, honeybee, bumblebee”, “theory, string, quantum particle, physic, black hole, gravity”, “graphene, plastic, flexible sheet, tube, printed, substrate, layer” or “manufacturing process, fabrication, printing, additive, technique, precision, material”, which capture highly specific research topics of interest to policymakers, it was at the same time difficult to report so many of them, and we were concerned that the data would be noisy.
To simplify things, we produced a topic network inside each discipline based on their jaccard distance (presence of topics or not in different projects), and once again performed a community detection analysis to identify clusters of topics, resulting in a final set of 88 research topics that we report in the visualisations.
Since we had geo-coded all organisations in the GtR data, it was relatively easy to ‘bin’ projects into regions and nations (i.e. Wales) and principal areas. But how could we classify projects into research topics? We opted for reporting slightly different things depending on the visualisation.
Finally, we were interested in identifying opportunities for collaboration between different organisations in Wales. We decided to represent this information using a ‘recommendation engine’. To build this, we created a ‘base map’ capturing actual research collaborations between Wales-based organisations.
We then had to find as yet unfulfilled opportunities for collaboration. Our logic was not that different from what Amazon or Netflix do: in the same way in which consumers might be interested in products purchased by others similar to them, we assumed that organisations might be interested in collaborating with those organisations collaborating with others who are similar to them.
Although the basic notion was simple, implementation was anything but: We began by calculating a research specialisation profile for each organisation based on the research topics of the projects they participate on (we tagged projects with their 5 most important topics). We then calculated similarities between organisations based on cosine distance between their specialisation profiles, and identified, for each organisation, the 10 most similar to it.
We then took each organisation (alter), and looked for the organisation (ego) that was most similar to it and extracted its top research collaborators, filtering out those of ego’s collaborators that had never participated in a Research Area that the alter was active on, and those organisations already collaborating with alter. We used this information to create an alternative opportunity network’, where every node (organisation) has a maximum of 5 connections with the top collaborators of the organisations that are similar to it.