Quantumrun

IMAGE CREDIT:

iStock

Web-scale content analysis: Making sense of online content

Web-scale content analysis can help scan and monitor the volumes of information on the Internet, including identifying hate speech.

Author:
Author name
Quantumrun Foresight
November 7, 2023

Insight summary

Machine learning and AI are revolutionizing the way we analyze vast amounts of online content. Web-scale content analysis, a more extensive form of traditional content analysis, employs techniques like natural language processing (NLP) and social network analysis (SNA) to categorize and understand internet data. This not only helps in flagging harmful content like hate speech but also provides valuable insights into financial crimes, reducing analysis time significantly. However, the technology also raises concerns about the spread of deepfake content and propaganda. As it evolves, it has broader implications, including improved language translation, bias detection, and enhanced cybersecurity measures.

Web-scale content analysis context

Web-scale content analysis is a larger scale version of content analysis. This process comprises studying linguistic elements, especially structural characteristics (e.g., message length, distribution of particular text or image components) and semantic themes or meaning in communications. The goal is to reveal patterns and trends that can help AI better categorize the information and assign value to it. Web-scale content analysis uses AI/ML to automate the process through natural language processing (NLP) and social network analysis (SNA).

NLP is used to understand the text on websites, while SNA is utilized to determine the relationships between these sites mainly through hyperlinks. These methods can help identify hate speech on social media and study academic quality and community formation through online posts, comments, and interactions. In particular, NLP can break down the text into individual words and then analyze them accordingly. In addition, this algorithm can identify specific keywords or phrases within a website’s content. AI can also determine how often certain words are used and whether they are used in a positive or negative context.

Disruptive impact

Some scholars argue that because web content is exponentially increasing and becoming more unorganized and uncontrolled, there has to be a standardized method of how algorithms can index and make sense of all this information. While automated content analyses through coding have been around for decades, they mostly follow an outdated protocol: simply counting word frequencies and processing text files. Deep learning and NLP can do so much more by training AI to understand the context and motive behind messages. In fact, NLP has gotten so good at word analysis and categorization that it has birthed virtual writing assistants that can mimic how humans organize words and sentences. Unfortunately, the same breakthrough is now being used to write deepfake content like articles and posts designed to promote propaganda and misinformation.

Nonetheless, web-scale content analysis is getting good at flagging hate and violent speech, and identifying bad actors in social networks. All social media platforms rely on some content review system that can pinpoint those who promote illegal activities or cyberbullying. Aside from content moderation, web-scale analysis can create training data to help algorithms identify financial crimes, such as money laundering, tax evasion, and terrorist financing. In 2021, AI reduced the time it takes to analyze financial crimes from 20 weeks (equivalent to one human analyst) to 2 weeks, according to consultancy firm FTI.

Implications of web-scale content analysis

Wider implications of web-scale content analysis may include:

Advancements in language translation technologies because of AI’s extensive database of words and their culture-based meaning.

Tools that can detect and evaluate diversity and biases in speech and other content types. This feature can be useful in assessing the authenticity of op-eds and articles.