Tracking hate online, a data and technical challenge.

7 min readDec 11, 2019

The UK government proposed in April to create a new regulator to control “Big Tech”. This is part of a growing momentum from governments across the world who are now considering how to introduce similar regulation. In particular, governments are concerned about the spread of extremist and hate-related content.

glass.ai reads and understands text content from the open web and looks for meaningful content that can help drive insights around economic and social research. What can this tell us about the size of the data and technical challenge for regulators involved in trying to track harmful content online?

Size of the Challenge

In order to estimate the size of the data that needs to be monitored if the content is to be tracked then regulators might consider three categories: these are the Open Web, News and Blogs, and Closed Networks (e.g. Facebook). Outside of the scope of our analysis are the Deep Web (unindexed web content hidden behind forms, search boxes and paywalls) and the Dark Web (intentionally hidden web content that uses the Internet infrastructure but requires alternative methods to access).

1) Open Web

Open Web content is accessible to all and is indexed by web search engines. Based on reading half of the web domains registered online, glass.ai estimates that around 100M web domains are active on the open web globally. Reading a sample of these sites on a regular basis, we estimate that these sites have published around 8B separate web pages. About 5% of the content is new each month (changes or additions); that is 400M pages that potentially, regulators should track monthly for harmful content.

2) News and Blogs

For the purposes of the above analysis, we excluded news and blog content to treat separately from the other Open Web content.

glass.ai has analysed a large sample of around 5000 of the top Google news sources. Beyond this set of sources, news articles are only infrequently published. Within the top news sources, glass.ai estimates that news sources, such as The New York Times, BBC, Huffington Post, and BuzzFeed usually publish a couple of hundred new stories each day. The remaining sites publish on average under 10 new stories each day. In total, glass.ai estimates 125K new news stories published worldwide each day. This amounts to approximately 3.75 million new news stories each month globally.

In addition to these news sources, new articles are also published on personal and organisation blogs. WordPress is the largest blogging platform with around 60% of market share. Currently, around 2.4M blogs are posted across WordPress sites each day; extrapolating to the rest of the market there are potentially 4M new posts published each day; circa 100 million new blog posts each month.

3) Closed Networks

Closed network content is often hidden behind user accounts. The largest examples of this are within the major social networks. Although the content is mostly hidden from the public internet, it is possible to use what is visible on the public web to give clues to the size hidden behind the logins.

glass.ai technology is targeted at the Open Web so excludes platforms built around user accounts and user-generated content. To examine what lies outside of the content read by glass.ai, we have used google “site:” search to examine how much content Google has indexed from key social media and consumer content sites.

Facebook has around 1B pages in the google index for facebook.com and 181M on instagram.com. These will include public accounts on Facebook products and other public-facing content. It is estimated that around 20% of Facebook accounts are public. This leads us to conclude that Facebook contains at least 5B pages in their — primarily closed — network, probably much more. Looking the other way around, Facebook currently claims over 2.5B users, which suggests 500M public profiles. These public profiles lead to the 1B indexed pages, so supports 2.5B generating a minimum of 5B pages, constantly changing timelines and news feeds with new status updates, comments and shares. Facebook claimed 4.75B pieces of content shared per day back in 2013.
Twitter, by contrast, has mostly open content although you need to be within the network to follow accounts and get access to its functionality. There are 383M pages indexed on google which ties in with recent claims by Twitter of 320M active users. These pages are constantly being updated from over 500M tweets per day.

The Tech Giants

We have already seen that Facebook alone appears to control over 5B pages online, if we take a similar look at the other web behemoth Google then can we get an idea of the amount of the internet they both have control over.

The YouTube video sharing platform owned by Google has over 3B pages indexed in the Google search engine.
The Blogger blogging platform is also owned by Google and has 750M results for blogspot.com in the Google index.

In fact, the extent of Google is so large that a recent study found that 12% of all clickthroughs from Google searches landed on another Google-owned site. That’s if a clickthrough occurs at all! The same study found that as Google has increased the visibility of structured results (e.g. cinema listings, sports results, Wikipedia, answer panel, travel, and much more), the number of users leaving the site by following a result has dropped to 49%.

If we add these to the total from Facebook, then we can see from this small sample that the major tech players control more online content, 9B pages, than it is estimated to be available for the rest of the open web (8B pages mentioned earlier). In other words, nearly 53% of what we find in the open and closed web is related to Facebook or Google properties alone.

Tracking Hate

Let’s imagine for a moment that a regulator has access to all this data (open web, news and blogs, closed networks). If regulators have the capacity to monitor the volume of content published online and keep up with its ever-changing and growing landscape then there is still the technical challenge of trying to identify the hate content, ideally on a real-time basis.

hatebase.org holds a large repository of hate-related words and terms. glass.ai took a small sample of these to see what the share of hate terms was over different sources.

We discovered that only a small fraction of the hate words appeared on the open profiles of main social networks. We can expect this to increase inside the networks. Maybe 5 fold increase based on the ratio of public to private accounts, although could be higher if people are less inhibited in what they say within the closed network. Having said this, the relatively low numbers may also be because these networks are actively making efforts without legislation to clean up content and already have armies of people involved in manually monitoring and reviewing content.

But we can see that a bigger problem may exist outside the closed networks as the content is spread across disparate owners or where there is less focus on dealing with harmful content. A knock-on effect of this is that often these open sources are what is being shared within the social networks. Content is written outside the network and then shared and spreads within social media. Stopping the spread does not remove the source, which then makes it easy to reappear on social media very quickly.

In terms of the technical challenge itself, the sample highlighted that not all occurrences of the hate terms were in the context of harmful messages. There needs to be a fuller understanding of the content in order to make a decision on the nature of what is being said. Also, in several cases it was noted that although a phrase may have started off from a harmful source, it had found itself in general use in another context (e.g. the use of “yellow bone” has gained popularity among young black people to the extent it appears regularly in hashtags on Twitter and Instagram as a positive description of light-skinned black people, but others view it as negative with a background of racial discrimination). Keeping up with the changing landscape of language requires understanding the context in which the phrases are used and more fully comprehending what is being said.

The Challenge Ahead

We can see there is a significant challenge ahead for regulators. There is a huge amount of content published online, 18B+ pages, and this content is constantly changing and growing, maybe as much as 1/3rd per day. As well as the size challenge there is a problem of ownership, some of the social media giants are already taking steps to try to remove harmful content but as we have seen the problem could be even larger outside on the open web where ownership is a much more complex issue. Finally, there is the technical challenge, simply looking for hate terms and phrases will lead to a lot of false positives. More sophisticated methods of reading and understanding the context of what is being said or shown are also needed.