A brief explanation of technical terms used in the analysis


Amazon Web Services (AWS)

Amazon’s platform that offers scalable cloud computing services.

Artificial Neural Network

Is a collection of models used in machine learning which uses a set of artificial neurons to identify patterns in the given data.

Bag-of-words (BoW)

BoW is used to simplify the representation of a sentence. BoW models disregard the semantic structure of a sentence, the grammar and the word order but the multiplicity of words is stored.

Bayesian Information Criterion (BIC)

Is a method that takes the log likelihood of a model and applies a penalty to it for the number of parameters that have to be estimated.

Comparative advantage

An economic or innovative actor actor has a comparative advantage over others in an activity (e.g. an industry or research topic) if it can generate outputs more efficiently. In Arloesiadur, we proxy this using competitiveness indices which consider how strongly represented an area or sector is in a location compared to the UK average. If the representation is stronger than the UK average (the competitiveness index is bigger than 1), we assume that the location is competitive in it.

Cosine similarity

Is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

Degree distribution

The probability distribution of the degree of all a network’s nodes.

Economic uniqueness

A sector’s economic uniqueness reflects its propensity to appear in a small number of economically complex locations in the UK, based on geographical clustering data. This value reflects the extent to which the sector requires a diverse range of local capabilities to be competitive, and therefore is hard to imitate by other locations.

Gaussian Mixture Model (GMM)

GMM is used to distinguish the subpopulations of a bigger set. GMM assumes that the data points were generated from a mixture of Gaussian distributions with unknown parameters which are approximated using Expectation-Maximisation (EM) algorithm.


We have segmented industries at a high level of resolution (4-digit Standard Industrial Classification codes) into 76 detailed industries based on their propensity to locate in the same places, recruit people in the same occupations, and trade with each other. We then aggregated these 76 detailed industries into 4 aggregate sectors.

Infrastructure-as-a-Service (IaaS)

A category of cloud computing that provides virtualised computing resources over the Internet.

Latent Dirichlet Allocation (LDA)

A generative model that can find the latent topics contained in a document, where each topic is a distribution of words.

Level of funding

The value in pounds of a research project funded by UK research councils according to Gateway to Research data. It is important to note that funding data is only available at the project, rather than the organisation level. This means that whenever we mention funding in our research visualisations, we are referring to the funding awarded to projects led or involving Welsh organisations, rather than the amount of funding going to Wales.

Levenshtein Distance

Measures the difference between two string sequences. It calculates the number of insertions, deletions and substitutions that have to be made in order to change a word to another.

Maximum spanning tree

A spanning tree where the edges with the highest weight are kept.

Median Salary

The median salary earned by people working in that industry, based on data from the Annual Survey of Hours and Earnings.

Meetup group

A group created on that usually has a specific theme and organises events. Users can join Meetup groups, participate in these events and join discussions.

Meetup tag

Tags characterise the key activities of a Meetup group. Every Meetup group organiser has to provide some tags that describe the group.


A continuous sequence of N elements of a given piece of text.

Node degree

The number of connections that a node has.

Paragraph Vector (doc2vec)

An extension of word2vec that learns to correlate not only words with words but also tags with words.

Platform-as-a-Service (PaaS)

A category of cloud computing that provides to the users the ability to create and run applications without the cost of building and maintaining the infrastructure.

Predictive analysis

We have used historical data on business activity by sector and principal area, together with data on educational levels and economic complexity to train a model which predicts the probability that a principal area will gain business specialisation in an industry. The model produces a probability for each industry between 0 and 1. We have classified sectors with a probability above 0.75 in a ‘high probability’ category, sectors above 0.5 in a ‘medium probability’ category, and sectors below 0.5 in a ‘low probability’ category.

Principal area

Our sub-national analysis focuses on 22 principal areas, an official geography that reflects local government structures in Wales.

Regular expression

A sequence of symbols and characters to be searched in a piece of text.

Research area

The broad discipline we classify a research project on based on a predictive model that takes into account the text in its abstract and the research council that funded it.

Research topic

A detailed domain of research inside a research area, which we identify using natural language processing and community detection methods.

Silhouette score

Is a method to validate the consistency and evaluate the optimality of the groups that were created by a clustering technique.

Software-as-a-Service (SaaS)

A category of cloud computing that provides to the users software on a subscription basis.

Spanning tree

Subgraph of an undirected network that includes all the nodes of the original graph and the minimum number of edges.

Stop words

Common terms that do not provide any insight on the content of a sentence.

t-distributed stochastic neighbor embedding (t-SNE)

A non-linear dimensionality reduction algorithm that is particularly good at approximating the distance between points when transferring them from a high dimensional space to 2D.

Tech topic

Topic is a collection of Meetup tags that can be found in similar context or are used to describe the same thing. In our analysis, we have created a hierarchy of topics. Initially, 42 sub-categories were created which were then aggregated into a 9 broad topics.

Term frequency, inverse document frequency (TF-IDF)

TF-IDF is a weight used to evaluate the importance of a word in a document of a corpus. The former part, TF, shows the raw frequency of a word in a document, while the latter, IDF, measures how often the word is found across the corpus.


Tokens are the meaningful entities of minimum length that are created after splitting a piece of text to its parts. For instance, phrases, symbols, words, numbers and punctuation are some commonly created tokens. In Arloesiadur, tokens are the preprocessed tags that are used by the Meetup groups.


Shallow neural network that transform words into vectors. Its goal is to predict a word, given its context words. As a result, it can identify the linguistic connectivity of words and find the similarity between them.

Back to the top