top of page

From Text to Intelligence: A Comprehensive Analysis of Text Annotation (with 2024 Trend Insights)

"If NLP is the crown jewel of AI, text annotation is the polishing wheel." – Lin Du, CEO of BasicAI.

Contents


From Text to Intelligence: A Comprehensive Analysis of Text Annotation

Chatbots now handle personalized refunds. Mobile input methods predict user thoughts. We jokingly call the prompts thrown at large language model (LLM) agents as "spells", reflecting the awe-inspiring machine intelligence. Each event represents areas pioneered by humans, such as computer vision, machine learning, and most importantly, natural language processing (NLP). Machines are evolving to understand how humans converse, express, comprehend, respond, analyze, and even mimic human dialogues and emotionally driven behaviors. This has significantly impacted chatbots, text-to-speech, speech recognition, virtual assistants, and more.

Text annotation is vital in NLP development. In this article, we will explore various aspects of text annotation and uncover the story behind the interaction between AI and humans.

From NLP to Text Annotation

Artificial Intelligence (AI) is often divided into computation, perception, cognition, and creation levels. The progression from computation to perception to cognition is a recognized AI development path. Cognitive intelligence, enabling machines to think like humans, is the current focus. This involves machines understanding data, language, and the real world, interpreting data, processes, and phenomena with human-like reasoning and planning. Cognitive intelligence must solve complex tasks like reasoning, planning, association, and creation.

Natural language, an interactive information source, has been a major cognitive AI challenge. Text is the most common data type, with vast amounts created since the 21st century. Daily encountered text can be general or professional. General text is often casual, with messy grammar and insufficient information. Professional text requires industry knowledge, has specific grammar, and is highly personalized. Language complexity and semantic differences across languages make automatic, intelligent language understanding an industry pain point. Thus, Natural Language Processing (NLP) emerged, extracting, inducing, and summarizing human language information, becoming a key AI evaluation criterion.

Visualization of Knowledge Graph. Source: https://dongshengwang.medium.com/5-reasons-knowledge-graph-will-never-bloom-418601957f33

"If NLP is the crown jewel of AI, text annotation is the polishing wheel."

Text annotation marks features, labels semantics, composition, context, purpose, emotion, and other data tags, helping machines recognize human intentions or emotions for accurate language understanding. Supervised and semi-supervised machine learning requires well-annotated training data. While public corpora exist, building professional corpora and performing text annotation often yields better vertical domain model training results. For example, annotating electronic medical records and literature trains high-precision medical intelligent assistants. Annotating court decisions and legal provisions develops legal consultation and case retrieval applications.

Types of Text Annotation Tasks

To understand text annotation types, let's examine the questions natural language understanding poses for text-based algorithms. From an NLP perspective, there are several main basic task categories:

  • Sequence Labeling: Includes part-of-speech tagging, word segmentation, named entity recognition, and semantic role labeling.

  • Classification and Clustering: Includes text classification, topic clustering, and sentiment classification.

  • Relationship Matching: Encompasses text matching, semantic similarity, and syntactic and logical judgments.

  • Text Generation: Includes text summarization, machine translation, and poetry and sentence generation.

Other basic tasks include graph computation and anomaly detection. Some semantic and pragmatic tasks combine the above basic tasks. Combining multiple algorithmic tasks or integrating with knowledge graphs can further solve these problems.

From this, we can infer the types of data text annotation should produce and summarize the following annotation task types:

Text Classification

Text classification, a fundamental NLP task, infers the label(s) for a given text (sentence, document, etc.). Classification tasks analyze text content to categorize texts by granularity, including topic classification, topic clustering, sentiment classification, and other forms. They can also filter user information and mine text based on hierarchical divisions, more accurately screening text content. For example, patent texts can be classified into different broad and narrow categories.

In text classification, annotators read many paragraphs or sentences, understand the emotions, sentiments, and intentions behind them, and classify the text into project-designated categories based on their understanding. This can be as simple as categorizing an article section as entertainment or sports or as complex as classifying products in an e-commerce store.

Examples:

"Climate change is a pressing global issue that demands urgent action from governments and individuals alike. Rising sea levels, more frequent and intense natural disasters, and the loss of biodiversity are just a few of the cascading consequences that we face if we fail to address this crisis. It is imperative that we reduce greenhouse gas emissions, invest in renewable energy, and adopt sustainable practices across all sectors of society."
💡 This passage emphasizes climate change's severity and urgency, detailing potential impacts and calling for comprehensive, multi-faceted responses. An annotator might classify this text into "environment" or "politics" categories and flag it as a high-priority issue.
"The thrilling match between the two rivals kept fans on the edge of their seats until the final whistle. The intensity on the field was palpable, with both teams displaying impressive skills and unwavering determination. The game served as a testament to the enduring spirit of sportsmanship and the ability of athletics to unite people from all walks of life."
💡 This paragraph describes an exciting sporting event, focusing on the game's intensity and audience engagement. An annotator would classify this text into the "sports" category, possibly with a subcategory for the specific sport. They might also note the strong emotional undertones and broader social commentary.

Entity Annotation

An entity is something with a concrete, real form or structure, generally referring to nouns with specific meanings or strong references in the text, including person names, place names, organization names, dates and times, proper nouns, etc. Entity annotation requires extracting entities from a sentence, such as television, football, door, etc. Sometimes it is also necessary to categorize the sentence, such as music, encyclopedia, news, etc., or to annotate action instructions in the text (open the door, play, etc.). Many companies apply named entity annotation in various application scenarios.

Annotators typically face three types of entity annotation tasks: keyword annotation, named entity recognition, and part-of-speech tagging. Among them, proper nouns, new words, or terms occupy the vast majority of named entities ("unregistered words"). Named entity recognition (NER) extracts entities with specific categories from unstructured text, such as product (PRODUCT)/attribute (PROPERTY)/component (COMPONENT) names in patent texts. Named entity labels consist of two parts: location indicators and entity categories. Location indicators refer to information about the current character's position in the named entity, such as the entity's beginning and end. Entity categories distinguish named entity types and are often abbreviated, such as PER for person names, LOC for place names, and ORG for organizational structures.

Examples:

"Microsoft, the tech giant founded by Bill Gates, has its headquarters in Redmond, Washington."
💡 The entities that may need annotation in this text are: "Microsoft" (organization), "Bill Gates" (person), and "Redmond, Washington" (location).

Relationship Annotation

Relationship annotation is a task that indicates important syntactic and semantic associations in compound sentences and formally marks automatic analysis of compound sentences. Relationship annotation involves finding entities in a text and determining the actual relationship between them, such as the "colleague," "classmate," or "teacher-student" relationship between people.

Text data used for knowledge graph training will use relationship annotation, also known as triple annotation. For example, in the sentence "The capital of the United States is New York," a triple can be extracted: "capital," "the United States," and "New York." Among them, "the United States" and "New York" are two entities, and "capital" is a relationship word describing the relationship between the two entities. These three words can form a triple (node1, edge, node2).

Examples:

"The Eiffel Tower is located in Paris, France."
💡 This sentence contains three entities: "Eiffel Tower" (landmark), "Paris" (city), and "France" (country). Annotators can annotate the relationships as "Eiffel Tower" -> "located in" -> "Paris" and "Paris" -> "capital of" -> "France."
"Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976."
💡 Annotators can annotate the relationships as "Steve Jobs" (person) -> "found" -> "Apple Inc." (organization), "Steve Wozniak" (person) -> "found" -> "Apple Inc." (organization).

Sentiment Annotation

Human responses can sometimes be sarcastic, and machines may interpret mocking negative comments as praise. Therefore, it is still necessary for people to judge the true sentiment of speech, which involves text annotation. Such annotations usually require determining the sentiment contained in a sentence, such as three-level sentiment annotation (🔺positive, ➖neutral, 🔻negative), while more demanding tasks may have six or even twelve levels of sentiment annotation.

To obtain this data, human annotators need to evaluate emotions and comments on online platforms (including social media and e-commerce websites) and be able to mark and report abusive, sensitive keywords or new words. Accurate sentiment annotation is crucial for training machine learning models that classify text into various sentiments and helps gain a deeper understanding of user sentiment toward products or services.

Examples:

"I guess the food was alright, but honestly, I was expecting more given all the hype around this place."
💡 This review conveys disappointment and unmet expectations. The speaker uses "I guess" and "alright" to imply a lukewarm sentiment, while mentioning the "hype" suggests that the experience did not live up to its billing. An annotator might assign a mildly negative sentiment label, like 2/5 or "slightly dissatisfied," to reflect the speaker's reaction.
"The customer service representative was helpful, but I felt like they were just going through the motions without really caring about my issue."
💡 This sentence expresses a complex sentiment. While the speaker acknowledges that the representative was helpful, they also indicate that the interaction felt insincere or lacking in genuine concern. An annotator might assign a neutral or slightly negative sentiment label, like 3/5 or "mixed", to capture the conflicting emotions in the speaker's sentiment.

Intent Annotation

As people increasingly use human-machine interaction for communication, machines must be able to understand natural language and user intent. Intent annotation is an important pillar in the development of chatbots, virtual assistants, and intelligent search engines. Multi-intent data collection and classification can divide intents into several key categories, including requests, commands, reservations, recommendations, and confirmations. For example, customers may have clear intents to query the weather, with intents such as "query weather," "query weather – rain," "query weather – fog," and "query weather – temperature."

Examples:

"Could you please provide me with more information about your pricing plans?"
 ("pricing_inquiry" or "request_pricing_information")
"Can you tell me more about your return policy?" 
("return_policy_inquiry" or "request_return_information")
"I'm really frustrated with how long it took to receive my order." 
("delivery_complaint" or "express_dissatisfaction")
"I truly appreciate your patience and understanding throughout this process." 
("express_grateful" or "thank_for_support")

Text Annotation Process

Data annotation generally consists of several steps, including collection, cleaning, annotation, and quality control. Text data annotation is no exception, and the general process can be summarized as follows:

  1. Text data acquisition: Raw text data is obtained through various methods, such as offline collection or web scraping, depending on the project requirements.

  2. Data preprocessing: Collected data undergoes preprocessing to remove duplicates, invalid entries, and any redundant or useless information.

  3. Developing annotation guidelines: Clear, consistent, and easy-to-follow annotation principles and processes are defined through guidelines to clarify the problem and expectations, ensuring that annotators have a uniform understanding of the task.

  4. Data annotation: Suitable annotation tools are chosen, and the annotation is performed according to the specified requirements.

  5. Quality control: Manual or automatic reviews of the annotation task are conducted to prevent errors, biases, or inconsistencies. Multiple annotators can label the same samples to ensure consistency and reduce interpretation bias.

  6. Output of results: After data annotation, technical means are used to convert the annotated results into the format required by the algorithm, which is a crucial part of an AI trainer's job responsibilities.

Real-life Challenges and Solutions in Text Annotation

Abstract and incomprehensible label definitions: Overemphasizing generalizability can lead to ambiguous definitions that are difficult for annotators to understand and apply. Provide specific examples and list typical scenarios covered by the labels to help annotators form a concrete understanding.

Ambiguous label boundaries and difficult-to-distinguish samples: Continuously collect and update confusing sample cases, organize regular discussions among the annotation team to clarify label boundaries, and strengthen communication between annotators and task initiators to promptly resolve judgment issues.

Difficulty in single-person simultaneous multi-label annotation: Control the number of labels annotated by a single person or adopt a method where each person is responsible for one label, transforming the multi-label task into a binary classification task.

Lack of effective communication and review within the annotation team: Establish a daily summary and review mechanism, encouraging annotators to share difficult samples and personal insights to unify the group's annotation scale. Invite task initiators to participate and provide guidance from a business perspective.

Tight annotation schedules and difficulty in ensuring quality: Fully evaluate the workload and difficulty of text annotation, allow reasonable time margins, and prioritize annotation quality over speed to avoid rework due to quality issues.

Annotators' unfamiliarity with the business: Organize business training in the early stages, write detailed annotation guidelines listing common cases, and maintain close communication with the business side to promptly resolve deviations in understanding.

Lack of effective quality control measures: Task initiators should spot-check annotated data daily, provide timely feedback on error samples, and periodically use lightweight models for quality inspection to objectively evaluate annotation quality and make necessary adjustments to the annotation scheme.


Text Annotation Example on BasicAI Cloud Text Labeling Tool

High-quality text annotation requires joint efforts from the annotation team and business side, involving meticulous control and optimization in various aspects. Supplemented by necessary training and feedback mechanisms, satisfactory annotation results can be obtained, laying a solid foundation for subsequent model training and optimization.

Principles of Text Annotation

Through the above sections, we understand that high-quality text annotation is key to project success. To ensure annotation quality, the following principles should be followed:

Understanding linguistic fundamentals: Familiarity with branches of linguistics such as syntax, semantics, and discourse analysis helps determine data annotation methods and clarify the purpose of the text.

Adopting an iterative annotation process: The text annotation process requires multiple iterations between modeling and annotation to continuously optimize and establish a comprehensive annotation model.

Maintaining consistency of annotated data: Establish annotation standards, refine annotation methods, and use cross-annotation approaches to reduce differences between annotators.

Developing clear annotation rules: Establish clear annotation rules before formal annotation begins, such as single-label, multi-label, embedded, or separate annotation, and write an annotation manual for large projects.

Implementing a strict review system: Completed text annotations must undergo a rigorous review process, preferably performed by personnel involved in formulating the annotation rules, with reasonable resource allocation.

Clarifying the application purpose of the text: Understanding the purpose of the text, such as text classification or machine translation, helps establish appropriate annotation standards and models.

Text annotation is complex, and various problems may arise during implementation. If lacking sufficient experience, trial-annotate a small amount of data first before gradually expanding the annotation scale. Text annotation significantly impacts the training results of machine learning models, and its importance is no less than model construction and algorithm optimization. Therefore, the step must be highly valued when carrying out AI projects.

Text Annotation Tools

Open Source Options

Brat

Brat is a widely used open-source text annotation tool in academia. It runs on a Python-based server using the Ubuntu operating system and requires the Google Chrome browser for web browsing. Brat supports the annotation of entities, events, relationships, and attributes, and allows for the definition of entity types in the annotation.conf file according to specific tasks.

Brat Text Annotation Tool. Source: https://brat.nlplab.org/

Doccano

Doccano is an open-source text annotation tool developed using Django as the web framework and Node.js as the front-end framework. It provides annotation functions for text classification, sequence labeling, and sequence-to-sequence, supporting the creation of labeled data for sentiment analysis, named entity recognition, and text summarization. Although Doccano does not currently support entity relationship and event annotation, it is simpler and more lightweight to operate compared to Brat.

Doccano Text Annotation Tool. Source: https://doccano.github.io/

Xtreme1

Xtreme1 is the world's first open-source multimodal training data platform. For text tasks, Xtreme1's highlight is its RLHF multi-turn dialogue annotation function oriented towards large models. Users can create classification labels such as "quality score", "humor level", and "helpfulness" for text data, or intuitively evaluate text data through "like" or "dislike" annotations. This facilitates more detailed evaluation and analysis of the data, providing beneficial reference for future model training. Users can also input replies through the "long text" function to form new data, helping expand the diversity of the dataset and improve the generalization ability of the model.

Xtreme1 RLHF Text Annotation Tool. Source: https://github.com/xtreme1-io/xtreme1

Free Cloud-based Text Annotation Tool, BasicAI Cloud

If you need a text annotation tool that is ready to use, BasicAI Cloud is a good choice. This platform supports full-category data annotation of text, images, videos, audio, and 3D sensor fusion. It includes an ontology center to manage multi-level classes and attributes, and its collaborative annotation system allows teams to annotate text datasets simultaneously. Free accounts can use all the platform's features. Here's a step-by-step guide on how to use BasicAI Cloud for text annotation:

Create a Dataset

Register an account on BasicAI Cloud, enter the homepage, switch to the "Dataset" page, and click "+ Create" to create a dataset. Select the "Text" type under "Dataset Type", name the dataset, and upload the data that needs to be annotated. Text data supports .txt, .csv, .xlsx, .xls, or compressed packages containing valid text data.


Text Annotation Tool on BasicAI Cloud - Step 1

Define Ontology (Multi-level Labels)

In this example, our text data is a short biography of Amelia Earhart. It requires text classification, entity annotation, and relationship annotation. Therefore, we need to define the ontology first.

In the "Ontology" tab, define the global classification label of the text under "Classification". Create entities and relationships under "Class & Relation", and define their subordinate attributes. Assign different colors to distinguish them.


Text Labeling Tool on BasicAI Cloud - Step 2

Text Annotation

With everything ready, we return to the "Data" tab to start annotating. Click "Annotate" floating above the data to enter the tool interface.

Text classification: The first task is to classify the text. Click the "Classification" tab on the far right to select the classification of the text, which is the "Text Type" classification label we created first.

Entity annotation: Next, click "Results" on the far right to enter text annotation. Select the words or sentences you need to annotate, just like you would select text in Word, and assign labels to them to complete entity annotation.

Relationship annotation: Switch to the "Relation" series, click on one entity, then click on another entity to create a connecting line. You can assign the relationship labels you defined earlier to this connecting line.

Data export: Once everything is done, click "Save" in the upper right corner to exit the tool interface, and click "Export" to create an export task. It can be exported to your cloud storage (Amazon S3, Google Drive, or Dropbox).


Text Annotation Tool on BasicAI Cloud - Step 3

Text-based Algorithm Application Fields

NLP algorithms trained by text annotation have a wide range of applications, particularly in industries such as new retail, customer service, advertising, finance, and healthcare. The main application types include data cleaning, semantic recognition, entity recognition, scene recognition, emotion recognition, and response recognition.

Search and Recommendation

Search and recommendation are effective means for users to solve information overload. Search tasks have expanded from simple text queries to retrieval of multimodal information, including images, audio, and video. Retrieval methods include point-to-point retrieval, exact matching, and semantic matching. Recommendation tasks provide active feedback, using precise recommendation systems to suggest entities similar to user preferences in an explainable way. Search provides feedback through user input, intent understanding, and recall ranking, while recommendation performs feature engineering design for personalized user data and combines recall strategies to provide feedback to users in a hot-start manner.


Text Annotation Applications - Search and Recommendation

Sentiment Analysis

Sentiment analysis has broad application scenarios, such as review analysis for consumers and manufacturers, current affairs commentary for predicting international events, and public opinion monitoring using online social media. By analyzing the spread and interaction of individual emotions, attitudes, opinions, and views, a timeline of key information is formed, allowing for quick capture of hotspots and facilitating strict monitoring and guidance of public opinion. Sentiment analysis has broad prospects in stock prediction, crisis early warning, event evolution, topic detection, and topic focus.


Text Annotation Applications - Sentiment Analysis

Intelligent Customer Service

Intelligent customer service can propose corresponding scenarios based on the user's consultation content and existing corpora, reducing labor costs by handling problems that can be solved before handing over to manual customer service. Establishing the response system requires classifying textual materials generated by users' consultation language, pre-labeling users' questions, and inputting them into the corresponding model. Data annotation in this context involves labeling the scene of the sentence and subdividing users' questions into corresponding scenes, requiring familiarity with the industry's business logic to establish the robot's response knowledge base.


Text Annotation Applications - Intelligent Customer Service

Intelligent Finance

The financial industry is developing towards networking, intelligence, and personalization. In corporate signing, reading key information in commercial contracts, such as company names, contract numbers, invoice numbers, related amounts, due dates, and risk warnings, is particularly important. Establishing an enterprise contract analysis model to extract relevant information from contracts can reduce labor, lower costs, and improve work efficiency.


Text Annotation Applications -

Intelligent Healthcare

Hospitals are using robots to provide guidance services, such as appointment registration, department navigation, and patient condition analysis. Intelligent robots can understand doctors' requirements and lead patients to the examination room and treatment department. Medical record text annotation involves labeling information with text boxes and establishing an electronic medical record system through text transcription of medical record content, which can be used as a basic data source for disease research and clinical trials. Annotating and processing natural language in the medical industry requires a high level of professionalism and specialized medical talents.


Text Annotation Applications - Intelligent Healthcare