Analysing industrial data

Juan Mateos-Garcia | June 19, 2017

In Arloesiadur, we have made strong use of official datasets with information about economic activity by sector and location, productivity, median income and economic complexity (and uniqueness). The reason for using these data  is that, despite their important limitations - such as lags in data availability and industrial classifications that do not quite reflect practice in the cutting edge of innovation (no blockchain companies in economic statistics) - these datasets contain quality-assured, policy relevant information about economic activity and its decline and growth. These are important manifestations of innovation and its impacts which we need to measure. Official data can also help triangulate the results of other, experimental datasets, such as the tech networking and business website data we analyse elsewhere in Arloesiadur.

In this blog we describe the data sources we have used, how we processed them, and some limitations of our analysis. You can go here to look at the code we used to implement the analysis, and the processed data we created for our visualisations.

Defining industrial segments

We  wanted to analyse levels of economic activity by sector and location in Wales and its local economies (defined at the level of principal areas, an official geography capturing units of local government). This analysis was informed by the concept of industrial clusters - the notion that geographical concentration of industries in a geographical spot makes them more competitive and productive because they are able to share a ‘thicker’ talent pool, collaborate more effectively, and share ideas. To define these clusters, we had to group different  industries together in an economically meaningful way.

We did this following a methodology developed by US economists Mercedes Delgado, Michael Porter and Scott Stern in a 2016 paper (PDF). In a nutshell, this involves  implementing a clustering algorithm that groups together industries that are similar to each other. These industries are initially defined at a high level of granularity - four-digit Standard Industrial Classification codes, of which there are 616 in the datasets we used.

How did we define similarity? We used several metrics: Similar industries are those that:

  • tend to locate close to each other,
  • hire people in similar occupations, and
  • trade with each other.

We made these similarities operational with data from various official datasets such as the Interdepartmental Business Register (IDBR) and the Business Register Employment Survey (BRES) (to measure co-location), the Annual Population Survey (to look at the occupational composition of the workforce), and input-output tables (to analyse trade patterns).

We benchmarked different clustering algorithms and parameters based on how good they were at generating industry segments formed by industries which are very similar to each other, and dissimilar from those in other segments. After some manual revision and cleaning, we ended with a list of 71 clusters, which we named, and classified into 4 aggregate industries (primary, farming, manufacturing and services).

Measuring sectoral performance

We were interested in understanding innovation in different industries . Unfortunately, the main dataset we could have used to do this, the UK Innovation Survey collected by the Department for Business, Energy and Industrial Strategy (BEIS) is not available at the level of resolution we needed to generate estimates of innovation according to the industries we have defined.

As an alternative, we used the Annual Survey for Hours and Earnings, which collects data on median salaries in different industries nationally. The median salary of a sector offers a rough proxy of that sector’s labour productivity, which we know is correlated to its levels of innovation. Although we also generated sectoral metrics of GVA per worker using the Annual Business Survey data (another business survey used to produce GVA estimates), we did not incorporate it in the analysis because we were  concerned  about its reliability at the level of resolution we wanted to use: ABS data was not always available at the 4-digit level, or for all our industry segments, and some of the results when we estimated it were counter-intuitive (for example, the R&D industry segment scored very low in its GVA per worker, perhaps because some of its activities are subsidised or because some of its companies are laboratories that are part of larger organisations, which means that sales minus costs might not be a good measure of value added).

We also characterised economic sectors using an index of economic uniqueness based on the ‘method of reflections’ developed by Ricardo Hausman and Cesar Hidalgo. This method uses industrial clustering data to measure the economic complexity of a location (locations which are more complex tend to host a more diverse set of productive capabilities, proxied through the industries they specialise on) and the economic uniqueness of a sector (sectors which are more unique tend to be found in a smaller number of highly complex economies). We think of this index as a measure of the extent to which a sector is ‘commoditised’ or not (if it is economically unique). Economically unique sectors are more likely to be based on rare combinations of capabilities enabling those locations that specialise in them to gain market share and the strong profits that come with it.

Measuring geographical performance

We produced a simple competitiveness index to measure the economic performance of an industry in a location (e.g. Wales, or one of its principal areas). This competitiveness index, which also receives other names such as ‘Revealed Comparative Advantage’ or ‘Location Quotient’ captures if an industry is over-represented in a location compared to the UK average. Over-representation is used as a proxy for specialisation and local comparative advantages which might be brought about by access to unique endowments (coastal areas tend to excel in Fishing and Sea transport) as well as knowledge spillovers and innovation.

Predictive analysis

We also carried out an experimental predictive analysis to identify the industries that a local economy (i.e. a principal area) is likely to become more specialised on based on its current industrial composition, the specialisation of neighbouring areas, and other local factors such as access to skills and local economic complexity.

Our approach was to train several machine learning models on a dataset with historical information about changes in specialisation and the  predictors we mentioned, and then use that model to predict future developments. We measured changes in specialisation in terms of business counts rather than employment because we found that competitiveness indices based on the latter are noisier, particularly for smaller areas. 

We trained three types of models - logistic regression, naive bayes and random forests algorithms - on these data using three-fold cross-validation (a way of splitting the data to ensure that the models being used generalise well to new data). We found that logistic regression and naive bayes performed better than random forests so we averaged their estimated probabilities of specialisation gain for each location and sector. We then binned these probabilities into three categories: sectors with a probability above 0.75 were classified as ‘high probability’, sectors between 0.5 and 0.75 were classified in a ‘medium probability’ category, and sectors below 0.5 were classified in a ‘low probability’ category.

Some caveats

Our analysis is experimental, and we advise caution in the interpretation of findings. To begin with, we were surprised to find that many sectors in the IDBR dataset have no activity in some locations, in some cases clashing with the BRES data we use to measure employment. For example, Cardiff has 0 ‘cultural services’ businesses (according to IDBR) in 2015 and 700 people working in that same sector (according to BRES). One explanation for this is that the IDBR data we have access to, through ONS’ Nomis labour market statistics website, rounds down sectors with few observations to zero.

Another potential issue is that location quotients (the metric we use to measure competitiveness) can get noisy when one is looking at small areas: the reason for this is that in those areas where the economic base is lower, small random disturbances in activity in one sector can have a big impact on the location quotient. We have, wherever possible, provided information about total levels of activity in the visualisations, so you can get a sense of the robustness of a location/sector competitiveness index.

The median income and economic uniqueness variables are calculated at the UK level. This does not take into account potential differences in the economic performance of sectors depending on their location, which is what we would expect to see when there are clustering effects (what economic geographers refer to as ‘Agglomeration economies’). Unfortunately, the salary data is not available at a level of resolution that would allow us to do this.

Finally, our predictive analysis is highly experimental, and its outputs are probabilistic. We grouped probabilities into 3 broad groups to avoid giving an impression of spurious accuracy or of certainty (in those cases where the model estimated that a sector had a probability of 0% or 100% of gaining specialisation in a location). It is also important to remember that the models have been trained on historical data but innovation is by definition unpredictable. Changes in the drivers of economic specialisation which will indubitably happen in the future are not included in the model, so its outputs should be considered as signals or clues of what might happen, rather than set in stone conclusions. Bearing in mind these limitations, our hope is that the results of the model widen the space of possibilities that policymakers take into account when they consider future scenarios for their local economies, and potential interventions to drive innovation and growth.

Back to the top