Problematic training data: When AI is taught biased data

IMAGE CREDIT:
Image credit
iStock

Problematic training data: When AI is taught biased data

Problematic training data: When AI is taught biased data

Subheading text
Artificial intelligence systems are sometimes introduced with subjective data that can affect how it acts and makes decisions.
    • Author:
    • Author name
      Quantumrun Foresight
    • October 14, 2022

    Insight summary



    We are what we learn and internalize; this dictum also applies to artificial intelligence (AI). Machine learning (ML) models fed with incomplete, biased, and unethical data will ultimately make problematic decisions and suggestions. These powerful algorithms may then influence users' morality and perceptions if researchers aren't careful.



    Problematic training data context



    Since the 2010s, research teams have been scrutinized for using training datasets with unsuitable content or gathered unethically. For example, in 2016, Microsoft's MS-Celeb-1M database included 10 million images of 100,000 different celebrities. However, upon further inspection, correspondents discovered that many photos were of ordinary people pulled from various websites without the owner's consent or knowledge.



    Despite this realization, the dataset continued to be utilized by major companies such as Facebook and SenseTime, a Chinese facial recognition company with links to the state police. Similarly, a dataset containing pictures of people walking on Duke University's campus (DukeMTMC) didn't collect consent either. Eventually, both datasets were removed. 



    To highlight the damaging effects of problematic training data, researchers at the Massachusetts Institute of Technology (MIT) created an AI called Norman that they taught to perform image captioning from a subreddit that highlighted graphic violence. The team then placed Norman against a neural network trained using conventional data. The researchers supplied both systems with Rorschach inkblots and asked the AIs to describe what they saw. The results were stunning: where the standard neural network saw "a black and white photo of a baseball glove," Norman observed "a man murdered by machine gun in broad daylight." The experiment demonstrated that AI is not automatically biased, but those data input methods and their creators' motives can significantly impact an AI's behavior.



    Disruptive impact



    In 2021, the research organization Allen Institute for AI created Ask Delphi, an ML software that algorithmically generates responses for answers to any ethical question. The researchers behind the project stated that AI is gradually becoming more powerful and familiar, so scientists need to teach these ML systems ethics. The Unicorn ML model is the foundation of Delphi. It was formulated to carry out "common sense" reasoning, such as selecting the most probable ending to a text string. 



    Furthermore, researchers used the 'Commonsense Norm Bank.' This bank consists of 1.7 million examples of people's ethical evaluations from places like Reddit. As a result, Delphi's output was a mixed bag. Delphi answered some questions reasonably (e.g., equality between men and women), whereas, on some topics, Delphi was downright offensive (e.g., genocide is acceptable as long as it made people happy).



    However, the Delphi AI is learning from its experiences and seems to be updating its answers based on feedback. Some experts are troubled by the research's public and open use, considering the model is in progress and is prone to erratic answers. When Ask Delphi debuted, Mar Hicks, a professor of History at Illinois Tech specializing in gender, labor, and the history of computing, said that it was negligent of researchers to invite people to use it, considering Delphi immediately provided extremely unethical answers and some complete nonsense. 



    In 2023, Rest of World conducted a study on bias in AI image generators. Using Midjourney, researchers discovered that the generated images affirm existing stereotypes. In addition, when OpenAI applied filters to the training data for its DALL-E 2 image generation model, it unintentionally intensified biases related to gender.



    Implications of problematic training data



    Wider implications of problematic training data may include: 




    • Reinforced biases in research projects, services, and program development. Problematic training data is particularly concerning if used in law enforcement and banking institutions (e.g., adversely targeting minority groups).

    • Increased investment and development in the growth and assortment of training data. 

    • More governments increasing regulations to limit how corporations develop, sell, and use training data for various commercial initiatives.

    • More businesses establishing ethics departments to ensure that projects powered by AI systems follow ethical guidelines.

    • Enhanced scrutiny on the use of AI in healthcare leading to stricter data governance, ensuring patient privacy and ethical AI application.

    • Increased public and private sector collaboration to foster AI literacy, equipping the workforce with skills for an AI-dominated future.

    • Rise in demand for AI transparency tools, leading companies to prioritize explainability in AI systems for consumer understanding and trust.



    Questions to consider




    • How might organizations avoid using problematic training data?

    • What are other potential consequences of unethical training data?


    Insight references

    The following popular and institutional links were referenced for this insight: