Vokenization: Language that AI can see

Image credit

Vokenization: Language that AI can see

Thrive from future trends

Subscribe today to equip your team with the leading trend and foresight platform used by multidisciplinary, future-focused teams working across departments in Strategy, Innovation, Product Development, Investor Research, and Consumer Insights. Convert industry trends into practical insights for your business.

Starting at $15/month

Vokenization: Language that AI can see

Subheading text
With images now being incorporated into artificial intelligence (AI) systems training, robots might soon be able to “see” commands.
    • Author:
    • Author name
      Quantumrun Foresight
    • May 9, 2023

    Natural language processing (NLP) has enabled artificial intelligence (AI) systems to learn human speech by understanding words and matching context with the sentiment. The only downside is that these NLP systems are purely text-based. Vokenization is about to change all that.

    Vokenization context

    Two text-based machine learning (ML) programs are often used to train AI to process and understand human language: OpenAI’s Generative Pre-trained Transformer 3 (GPT-3) and Google's BERT (Bidirectional Encoder Representations from Transformers). In AI terminology, the words used in NLP training are called tokens. Researchers from the University of North Carolina (UNC) observed that text-based training programs are limited because they cannot "see," meaning they cannot capture visual information and communication. 

    For example, if someone asks GPT-3 what the color of the sheep is, the system will often answer "black" even if it's clearly white. This response is because the text-based system will associate it with the term "black sheep" instead of identifying the correct color. By incorporating visuals with tokens (voken), AI systems can have a holistic understanding of terms. Vokenization integrates vokens into self-supervised NLP systems, allowing them to develop "common sense."

    Integrating language models and computer vision is not a new concept, and it is a rapidly expanding field in AI research. The combination of these two types of AI leverages their individual strengths. Language models like GPT-3 are trained through unsupervised learning, which allows them to scale easily. In contrast, image models like object recognition systems can directly learn from reality and do not rely on the abstraction provided by the text. For example, image models can recognize that a sheep is white by looking at a picture.

    Disruptive impact

    The process of vokenization is pretty straightforward. Vokens are created by assigning corresponding or relevant images to language tokens. Then, algorithms (vokenizer) are designed to generate vokens through unsupervised learning (no explicit parameters/rules). Common sense AI trained through vokenization can communicate and solve problems better because they have a more in-depth understanding of context. This approach is unique because it not only predicts language tokens but also predicts image tokens, which is something that traditional BERT models are unable to do.

    For example, robotic assistants will be able to recognize images and navigate processes better because they can “see” what is required of them. Artificial intelligence systems trained to write content will be able to craft articles that sound more human, with ideas that flow better, instead of disjointed sentences. Considering the wide reach of NLP applications, vokenization can lead to better-performing chatbots, virtual assistants, online medical diagnoses, digital translators, and more.

    Additionally, the combination of vision and language learning is gaining popularity in medical imaging applications, specifically for automated medical image diagnosis. For example, some researchers are experimenting with this approach on radiograph images with accompanying text descriptions, where semantic segmentation can be time-consuming. The vokenization technique could enhance these representations and improve automated medical imaging by utilizing the text information.

    Applications for vokenization

    Some applications for vokenization may include:

    • Intuitive chatbots that can process screenshots, pictures, and website content. Customer support chatbots, in particular, may be able to accurately recommend products and services.
    • Digital translators that can process images and videos and provide an accurate translation that considers cultural and situational context.
    • Social media bot scanners being able to conduct a more holistic sentiment analysis by merging images, captions, and comments. This application can be useful in content moderation that requires the analysis of harmful images.
    • Increasing employment opportunities for computer vision and NLP machine learning engineers and data scientists.
    • Startups building on these AI systems to commercialize them or provide customized solutions for businesses.

    Questions to comment on

    • How else do you think vokenization will change how we interact with robots?
    • How can vokenization change how we conduct business and interact with our gadgets (smartphones and smart appliances)?

    Insight references

    The following popular and institutional links were referenced for this insight: