Using AI and web data to understand the drivers of productivity.

glass.ai
6 min readJul 16, 2019

The West Yorkshire Combined Authority (WYCA) wished to explore how open web data and machine learning techniques could enhance official business data to help understand the drivers of productivity at companies. Funding for the project came via the BEIS Business Basics Programme, which is part of the Industrial Strategy, and delivered in partnership with Innovate UK and the Innovation Growth Lab at Nesta.

The Glass.ai engine has read and mapped a very large part of the UK economy based on its web presence. It regularly reads the websites of 1.5 million UK businesses (200M web pages) across sectors and geographies. It reads and structures all the text on the websites. Working with WYCA, a sample of this data was used to investigate indicators of high productivity companies.

Data

WYCA supplied a list of 3,491 companies in the WYCA region. Of these companies, 2,929 had web addresses assigned so were candidates for inclusion in the analysis. WYCA also supplied 2,856 companies with productivity data.

The list was compared with the Glass.ai dataset and we were able to match 2,491 companies with the data. The missed companies in the matched set were due to duplicates and dead sites. Duplicates included groupings of related companies, for example, Group company entities, Holding entities and operational company entities.

The data collected from the web includes descriptions, social media accounts, addresses and emails collected from the organisation website, also counts of related entities (e.g. news, people) found on the site. From social media and the open web, there are also web presence indicators. The descriptions and other text found on the organisation websites were also used to predict the sector(s) that the company operates in and the main topics related to the company’s activities.

The full topic list was used to investigate relationships to indicators of high productivity.

Analysis

We used the data collected from the web to find indicators of high productivity. For this analysis, we used those companies in Glass.ai that had corresponding productivity data from WYCA. This totalled 1,360 companies. It is worth highlighting that although there may be a correlation between the indicators and productivity that this does not imply causality.

To perform this analysis we identified the set of high productivity and low productivity companies based on the ranked values from WYCA. What we looked at first was whether there was a bias in the productivity sets based on the broad industry sectors provided by WYCA.

AVERAGE PRODUCTIVITY RANK BY INDUSTRY SECTOR

This chart shows that there isn’t an even distribution of organisations over the productivity ranking. For example, non-profit organisations often appear with low productivity. To remove this imbalance we selected the top and bottom 10 organisations by rank to form the sets of high and low productivity companies. This ensured that particular industry characteristics — rather than general organisation characteristics — wouldn’t dominate the chosen sets of organisations. Each set, high and low productivity companies, contained 140 organisations.

Using these groupings we examined the various characteristics collected from the web. The following charts show the ratio of the share of results in the high (blue) and low (orange) productivity groups of companies. For example, these show a positive correlation between published news and high productivity, but a negative correlation between LinkedIn references.

Share of high and low productivity companies by characteristic

There were specific themes that WYCA also wanted to explore in the data. The areas under investigation were Export, Innovation, Awards, Patents and Certification. For each of these, we mapped the areas onto sets of topics.

  • Export: import, export, trade, import taxes, import duties, imported goods, direct imports, international business, Chinese market, Russian market, French market, Japanese market, Australian market, American market, German market, European market, foreign markets, export cargo, export-import, export markets, import-export trade, export licensing, export products, international market, overseas markets, global export, international trade, major world markets, select international markets, key international markets, major international markets, Latin American markets, foreign markets, overseas markets, international market, Russian market, Chinese market, Japanese market, Australian market, European market, American market, Italian market, global expansion plans, international expansion plan, global growth, international growth, international expansion, global expansion, European expansion
  • Innovation: technological change, digital transformation, digital revolution, technological revolution, technological evolution, new emerging technologies, emerging technologies, disruptive innovation, innovation, product innovation, process innovation, industrial innovation, innovation system, innovation management, open innovation, research development
  • Awards: numerous accolades, prestigious industry awards, numerous industry awards, multiple awards, numerous awards, award nominations, national award, annual awards, industry awards, special award, awards
  • Patents: patent, patents, patented, patent pending
  • Certification: ISO, BSI

These groupings were then counted in the high and low productivity sets of companies to highlight any correlation with productivity, and show a positive relationship between export and innovation and high productivity, but interestingly a relationship between patents and low productivity — maybe to do with the cost of patent development or the characteristics of the industries these were found.

Share of high and low productivity companies by TOPIC

The final part of the analysis examined all the topics that appeared in the high and low productivity sets of companies and clustered them into groupings.

The clustering used was a form hierarchical agglomerative clustering based on a bottom-up approach. At first, each keyword is treated as a single entity cluster. Then, iteratively, clusters are merged in pairs based on an ad-hoc metric for measuring the semantic similarity between the keywords in the two clusters. The process continues until the similarity between any two clusters is lower than a given threshold. The remaining clusters are considered to be semantically coherent sets of keywords.

Using this method 73 clusters of topics were identified. As an example, the top two — most semantically — coherent clusters were as follows.

  • cluster_0: auditing,business ethics,continual improvement process,corporate governance,corporate social responsibility,environmental health,environmental impact,environmental impact assessment,environmental management system,environmental resource management,environmental responsibility,good practice,governance,health safety,impact assessment,information systems,integrated management system,management facilities,management systems,procurement,product development,professional development,project delivery,project management,quality assurance,quality control,quality management,quality systems,resource management,risk assessment,risk management framework,safety,safety management,safety management systems,strategic management,strategic planning,strategy,successful delivery,supply chain,sustainability,sustainability performance,sustainable development,transformation programmes
  • cluster_1: carbon,carbon emissions,district heating,district heating schemes,energy,energy consumption,energy efficiency,energy industry,energy scheme,energy wastage,fire protection,fire safety,generating capacity,heat network,heating,heating system,meter reading,onshore wind,renewable energy,renewable energy companies,renewable generation,smart meter,solar farm,solar project,sustainable energy,thermal efficiency,turbines,wind,wind farms

For each cluster, we then checked whether the cluster’s share in the high and low productivity sets to determine the ones that were better indicators of high productivity. We looked at the individual topics and the clusters as a whole. These led to highlighting the following sets of topics as high productivity indicators.

  • fleet operations,fuel efficiency,fuel usage,tracking,tracking solutions,tracking system,tracking tools,vehicle security,vehicle tracking solutions,vehicle tracking system
  • European patent,intellectual property,patent attorneys,patent law,patent offices,regulatory agencies,regulatory compliance,regulatory expertise,regulatory strategy,trademark
  • Europe, global, globes, world
  • carbon emissions,district heating,district heating schemes,energy,energy consumption,energy efficiency,energy industry,energy scheme,energy wastage,fire protection,fire safety,generating capacity,heat network,heating,heating system,meter reading,onshore wind,renewable energy,renewable energy companies,renewable generation,smart meter,solar farm,solar project,sustainable energy,thermal efficiency,turbines,wind,wind farms
  • capital,cash,credit,credit risk,equity funds,financial intermediary,funds,investee companies,investment,investment opportunities,loan fund,loans,venture capital,working capital,working capital loans
  • city, village
  • leaders, leadership, pioneers
  • oil,power,transportation,water
  • legislation,member state,policy,regulation,rules,use policy
  • business partner,business results,competitive advantage,customer experience,customer focus,customer relationship management,customer satisfaction,customer service,flexible approach,forward thinking,high quality facilities,higher level,operating cost,operational efficiency,partner companies,personal service,product performance,product quality,professional team,proven track record,quality customer,senior management,service level,staff members,teams,technical excellence,track record
  • community, organizations, partnerships, works
  • qualifications, training
  • creative business, entrepreneurs, new businesses
  • business ethics,continual improvement process,corporate governance,corporate social responsibility,environmental health,environmental impact,environmental impact assessment,environmental management system,environmental resource management,environmental responsibility,good practice,governance,health safety,impact assessment,information systems,integrated management system,management facilities,management systems,procurement,product development,professional development,project delivery,project management,quality assurance,quality control,quality management,quality systems,resource management,risk assessment,risk management framework,safety,safety management,safety management systems,strategic management,strategic planning,strategy,successful delivery,supply chain,sustainability,sustainability performance,sustainable development,transformation programmes

Discussion

We have seen that the rich set of data that can be collected from the open web can provide indicators of high productivity companies. This includes data collected from the organisation’s website, social media or online news services. Each of these contains positive indicators of productivity. In particular, we can look at the key topics that the organisations talk about as indicators of productivity. Export and Innovation appear to be high productivity indicators, but also clusters around the fleet’s operation, brand protection, corporate governance, energy efficiency and customer service seem to correlate with high productivity. Although this is not an indicator of causality, it does provide lots of clues for further investigation.

We would like to thank the WYCA team for inviting us to work on this project, in particular, James Hopton and Alex Clarke for interesting discussions on the drivers of productivity.

--

--

glass.ai

AI research capability that deep reads the web to understand sectors and companies. Used by governments, consultancies, corporates and universities.