# UnifiedQA Dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/unified-qa/unified-qa.ipynb)

The purpose of this notebook is to download datasets from the UnifiedQA dataset collection and convert them into a format that can be used for training the OpenAssistant.

The UnifiedQA repo can be found here: https://github.com/allenai/unifiedqa

If you extend or use this work, please cite the relevant papers:
```
@article{khashabi2022unifiedqa,
    title={UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training},
    author={Khashabi, Daniel and Kordi, Yeganeh and Hajishirzi, Hannaneh},
    journal={arXiv preprint arXiv:2202.12359},
    year={2022}
}
```

## Compare xP3 and UnifiedQA

As many of the datasets that are in UnifiedQA are already in xP3, we do a simple (and incomplete) check to limit the number of datasets that we download.

In [1]:
xp3_list = [
    "Code Miscellaneous",
    "CodeComplex",
    "Docstring Corpus",
    "GreatCode",
    "State Changes",
    "Closed-book QA",
    "Hotpot QA",
    "Trivia QA",
    "Web Questions",
    "Wiki QA",
    "Extractive QA",
    "Adversarial QA",
    "CMRC2018",
    "DRCD",
    "DuoRC",
    "MLQA",
    "Quoref",
    "ReCoRD",
    "ROPES",
    "SQuAD v2",
    "xQuAD",
    "TyDI QA",
    "Primary",
    "Goldp",
    "Multiple-Choice QA",
    "ARC",
    "C3",
    "CoS-E",
    "Cosmos",
    "DREAM",
    "MultiRC",
    "OpenBookQA",
    "PiQA",
    "QUAIL",
    "QuaRel",
    "QuaRTz",
    "QASC",
    "RACE",
    "SciQ",
    "Social IQA",
    "Wiki Hop",
    "WiQA",
    "Paraphrase Identification",
    "MRPC",
    "PAWS",
    "PAWS-X",
    "QQP",
    "Program Synthesis",
    "APPS",
    "CodeContests",
    "JupyterCodePairs",
    "MBPP",
    "NeuralCodeSearch",
    "XLCoST",
    "Structure-to-text",
    "Common Gen",
    "Wiki Bio",
    "Sentiment",
    "Amazon",
    "App Reviews",
    "IMDB",
    "Rotten Tomatoes",
    "Yelp",
    "Simplification",
    "BiSECT",
    "Summarization",
    "CNN Daily Mail",
    "Gigaword",
    "MultiNews",
    "SamSum",
    "Wiki-Lingua",
    "XLSum",
    "XSum",
    "Topic Classification",
    "AG News",
    "DBPedia",
    "TNEWS",
    "TREC",
    "CSL",
    "Translation",
    "Flores-200",
    "Tatoeba",
    "Word Sense disambiguation",
    "WiC",
    "XL-WiC",
    "Evaluation datasets (included in xP3all except for HumanEval)",
    "Natural Language Inference",
    "ANLI",
    "CB",
    "RTE",
    "XNLI",
    "Coreference Resolution",
    "Winogrande",
    "XWinograd",
    "Program Synthesis",
    "HumanEval",
    "Sentence Completion",
    "COPA",
    "Story Cloze",
    "XCOPA",
    "XStoryCloze",
    "Additional xP3all datasets",
    "Coreference Resolution",
    "WSC (Fixed)",
    "Sentence Completion",
    "HellaSwag",
    "Translation",
    "MultiEurlex",
]

In [2]:
unifiedQA_list = [
    "SQuAD 1.1",
    "SQuAD 2",
    "NewsQA",
    "Quoref",
    "ROPES",
    "NarrativeQA",
    "DROP",
    "NaturalQuestions",
    "MCTest",
    "RACE",
    "OpenBookQA",
    "ARC",
    "CommonsenseQA",
    "QASC",
    "PhysicalIQA",
    "SocialIQA",
    "Winogrande",
    "BoolQ",
    "MultiRC (yes/no)",
    "BoolQ-NP",
]

Now that we've defined the list of datasets (which we found in the paper for UnifiedQA and on the Hugging Face page of xP3) we can do the simple check.

In [3]:
for ds in unifiedQA_list:
    if ds not in xp3_list:
        print(ds)

SQuAD 1.1
SQuAD 2
NewsQA
NarrativeQA
DROP
NaturalQuestions
MCTest
CommonsenseQA
PhysicalIQA
SocialIQA
BoolQ
MultiRC (yes/no)
BoolQ-NP


The SQuAD dataset is actually covered (with a slightly different name) but the other ones should be downloaded.

# OpenAssistant Data Scheme

We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook.

In [4]:
from typing import TypeVar, List, Dict, Any, Literal
from json import JSONEncoder

T = TypeVar("T", bound="ConversationTreeNode")


class ConversationTreeNode:
    text: str  # The text of the node
    role: Literal["prompter", "assistant"]  # Whether the node is a user prompt/follow-up or an assistant response
    children: List[T]  # The children of the node (if you have a linear conversation, this will be of length 0 or 1)
    metadata: Dict[str, Any]  # Node metadata (see below)

    def __init__(
        self, text: str, role: Literal["prompter", "assistant"], children: List[T], metadata: Dict[str, Any]
    ) -> None:
        self.text = text
        self.role = role
        self.children = children
        self.metadata = metadata


class ConversationTree:
    root: ConversationTreeNode  # The node containing the initial prompt
    metadata: Dict[str, Any]  # Tree metadata, different from root node metadata.

    def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:
        self.root = root
        self.metadata = metadata


# subclass JSONEncoder
class TreeEncoder(JSONEncoder):
    def default(self, o):
        return o.__dict__

# Manually Get URLs

We now define the list of URLs that we want to download. These URLs were found by manually going UnifiedQA'S Google Cloud bucket: https://console.cloud.google.com/storage/browser/unifiedqa/data

In [23]:
urls = [
    "https://storage.googleapis.com/unifiedqa/data/natural_questions/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/narrativeqa/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/newsqa/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/drop/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/commonsenseqa/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/physical_iqa/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/social_iqa/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/boolq/train.tsv",
    "https://storage.googleapis.com/unifiedqa/data/boolq_np/train.tsv",
]

In [24]:
dataset_names = [url[len("https://storage.googleapis.com/unifiedqa/data/") :].split("/")[0] for url in urls]
dataset_names

['natural_questions',
 'narrativeqa',
 'newsqa',
 'drop',
 'commonsenseqa',
 'physical_iqa',
 'social_iqa',
 'boolq',
 'boolq_np']

## Convert each dataset to a Prompt-Response pair

We'll now create a dictionary of lists: for each dataset index (i) we will have a list that will hold templates (j)

In [236]:
converter_functions = {}

## 1. Natural Questions

Dataset has short answers but it the questions are framed as natural questions, as the data set name would imply.

In [237]:
converter_functions["natural_questions"] = [lambda a, b: [a, b]]

## 2. Narrative QA

In [238]:
def nar_qa_1(q, a):
    return [q, a]


def nar_qa_2(q, a):
    conv = []
    conv.append("I am going to be asking you some questions on the following text:" + q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def nar_qa_3(q, a):
    conv = []
    conv.append("I am going to be asking you some questions about the following text")
    conv.append(
        "Sure, I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def nar_qa_4(q, a):
    conv = []
    conv.append("I have a text that I need help with")
    conv.append(
        "I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def nar_qa_5(q, a):
    conv = []
    conv.append("Can you help me answer questions about a text?")
    conv.append(
        "Yes, as I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def nar_qa_6(q, a):
    conv = []
    conv.append("Based on the text that I will give you, please answer the following question: " + q.split("\\n")[0])
    conv.append(
        "Okay sure, as I can help you with answering the question '"
        + q.split("\\n")[0]
        + "'. What text should I use to answer this question?"
    )
    conv.append(q.split("\\n")[1])
    conv.append(a)
    return conv


templates_nar_qa = [nar_qa_1, nar_qa_2, nar_qa_3, nar_qa_4, nar_qa_5, nar_qa_6]
converter_functions["narrativeqa"] = templates_nar_qa

## 3. News QA

In [239]:
def news_qa_1(q, a):
    return [q, a]


def news_qa_2(q, a):
    conv = []
    question, context = q.split("\\n")
    try:
        context = context.split("-- ")[1]
    except:
        context = context
    conv.append("I am going to be asking you some questions on the following text:" + context)
    conv.append("Okay, what question do you have about the text?")
    conv.append(question)
    conv.append(a)
    return conv


def news_qa_3(q, a):
    conv = []
    question, context = q.split("\\n")
    try:
        context = context.split("-- ")[1]
    except:
        context = context
    conv.append("I am going to be asking you some questions about the following text")
    conv.append(
        "Sure, I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(context)
    conv.append("Okay, what question do you have about the text?")
    conv.append(question)
    conv.append(a)
    return conv


def news_qa_4(q, a):
    conv = []
    question, context = q.split("\\n")
    try:
        context = context.split("-- ")[1]
    except:
        context = context
    conv.append("I have a text that I need help with")
    conv.append(
        "I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(context)
    conv.append("Okay, what question do you have about the text?")
    conv.append(question)
    conv.append(a)
    return conv


def news_qa_5(q, a):
    conv = []
    question, context = q.split("\\n")
    try:
        context = context.split("-- ")[1]
    except:
        context = context
    conv.append("Can you help me answer questions about a text?")
    conv.append(
        "Yes, as I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(context)
    conv.append("Okay, what question do you have about the text?")
    conv.append(question)
    conv.append(a)
    return conv


def news_qa_6(q, a):
    conv = []
    question, context = q.split("\\n")
    try:
        context = context.split("-- ")[1]
    except:
        context = context
    conv.append("Based on the text that I will give you, please answer the following question: " + question)
    conv.append(
        "Okay sure, as I can help you with answering the question '"
        + question
        + "'. What text should I use to answer this question?"
    )
    conv.append(context)
    conv.append(a)
    return conv


templates_news_qa = [news_qa_1, news_qa_2, news_qa_3, news_qa_4, news_qa_5, news_qa_6]
converter_functions["newsqa"] = templates_news_qa

## 4. Drop

In [240]:
def drop_qa_1(q, a):
    return [q, a]


def drop_qa_2(q, a):
    conv = []
    conv.append("I am going to be asking you some questions on the following text:" + q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def drop_qa_3(q, a):
    conv = []
    conv.append("I am going to be asking you some questions about the following text")
    conv.append(
        "Sure, I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def drop_qa_4(q, a):
    conv = []
    conv.append("I have a text that I need help with")
    conv.append(
        "I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def drop_qa_5(q, a):
    conv = []
    conv.append("Can you help me answer questions about a text?")
    conv.append(
        "Yes, as I can help you with understanding and analyzing a text. What is the text that you would like me to work on?"
    )
    conv.append(q.split("\\n")[1])
    conv.append("Okay, what question do you have about the text?")
    conv.append(q.split("\\n")[0])
    conv.append(a)
    return conv


def drop_qa_6(q, a):
    conv = []
    conv.append("Based on the text that I will give you, please answer the following question: " + q.split("\\n")[0])
    conv.append(
        "Okay sure, as I can help you with answering the question '"
        + q.split("\\n")[0]
        + "'. What text should I use to answer this question?"
    )
    conv.append(q.split("\\n")[1])
    conv.append(a)
    return conv


templates_drop_qa = [drop_qa_1, drop_qa_2, drop_qa_3, drop_qa_4, drop_qa_5, drop_qa_6]
converter_functions["drop"] = templates_drop_qa

## 5. CommonsenseQA

In [241]:
def cs_qa_1(q, a):
    return [q, a]


def cs_qa_2(q, a):
    conv = []
    conv.append("I have a multiple choice question that I need help with")
    conv.append("Okay, I can help you with multiple choice questions. Please provide the question.")
    conv.append(q)
    conv.append("The answer is: " + a)
    return conv


def cs_qa_3(q, a):
    conv = []
    conv.append("I have some common sense questions for you to answer.")
    conv.append("Okay, I can try to answer your questions while using common sense. Please provide the question.")
    conv.append(q)
    conv.append("The commmon sense answer would be: " + a)
    return conv


templates_cs_qa = [cs_qa_1, cs_qa_2, cs_qa_3]
converter_functions["commonsenseqa"] = templates_cs_qa

## 6. Physical IQA

In [242]:
def ph_qa_1(q, a):
    return [q, a]


def ph_qa_2(q, a):
    conv = []
    conv.append("I have a multiple choice question that I need help with")
    conv.append("Okay, I can help you with multiple choice questions. Please provide the question.")
    conv.append(q)
    conv.append("The answer is: " + a)
    return conv


def ph_qa_3(q, a):
    conv = []
    conv.append("Can I ask you a question?")
    conv.append("Sure, you can ask me a question! I'll try my best to answer it.")
    conv.append(q)
    conv.append("I think the answer is: " + a)
    return conv


def ph_qa_4(q, a):
    return [q.split("\\n")[0], a]


templates_ph_qa = [ph_qa_1, ph_qa_2, ph_qa_3, ph_qa_4]
converter_functions["physical_iqa"] = templates_ph_qa

## 7. Social IQA

In [243]:
def so_qa_1(q, a):
    return [q, a]


def so_qa_2(q, a):
    conv = []
    conv.append("I have a multiple choice question that I need help with")
    conv.append("Okay, I can help you with multiple choice questions. Please provide the question.")
    conv.append(q)
    conv.append("The answer is: " + a)
    return conv


def so_qa_3(q, a):
    conv = []
    conv.append("Can I ask you a question?")
    conv.append("Sure, you can ask me a question! I'll try my best to answer it.")
    conv.append(q)
    conv.append("I think the answer is: " + a)
    return conv
    return conv


def so_qa_4(q, a):
    conv = []
    ques, options, context = q.split("\\n")
    conv.append("I have a question about this text:" + context)
    conv.append("Okay, what question do you have?")
    conv.append(ques)
    conv.append(a)
    return conv


def so_qa_5(q, a):
    conv = []
    ques, options, context = q.split("\\n")
    conv.append("I have a question about this text:" + context)
    conv.append("Okay, what question do you have?")
    conv.append(ques + "\\n" + options)
    conv.append(a)
    return conv


def so_qa_6(q, a):
    conv = []
    ques, options, context = q.split("\\n")
    conv.append("Based on the text that I will provide, please answer the following question:" + ques)
    conv.append("Okay, what text can I use to derive the answer?")
    conv.append(context)
    conv.append(a)
    return conv


templates_so_qa = [so_qa_1, so_qa_2, so_qa_3, so_qa_4, so_qa_5, so_qa_6]
converter_functions["social_iqa"] = templates_so_qa

## 8. BoolQ

In [244]:
def bq_qa_1(q, a):
    return [q, a]


def bq_qa_2(q, a):
    ques, context = q.split("\\n")
    conv = []
    conv.append(ques)
    conv.append(a.capitalize() + ". " + context)
    return conv


def bq_qa_3(q, a):
    ques, context = q.split("\\n")
    conv = []
    conv.append("Based on the following text, please answer my questions: " + context)
    conv.append("Sure, what question do you have?")
    conv.append(ques)
    conv.append("Based on the text above, the answer is: " + a)
    return conv


templates_bq_qa = [bq_qa_1, bq_qa_2, bq_qa_3]
converter_functions["boolq"] = templates_bq_qa

## 9. BoolQ NP

In [245]:
converter_functions["boolq_np"] = templates_bq_qa

## Helper Functions

In [252]:
## Quality assurance function
def is_valid_conversation(my_conv, q, a, verbose=False):
    if not len(my_conv) % 2 == 0:
        if verbose:
            print("Uneven number of entries in")
            print(q[:1000])
            print(a)
        return False
    if not all(isinstance(item, str) for item in my_conv):
        if verbose:
            print("Non-str entries in")
            print(q[:1000])
            print(a)
        return False
    return True

In [253]:
def print_conv(root):
    if root.text != None:
        print(root.text[:100])
    if len(root.children) > 0:
        print_conv(root.children[0])
    return ""

# Download and Save as Raw Inputs

We firstly import pandas, which we'll use to download the TSV files from Google Cloud Storage, and any other libraries that we'll need.

In [254]:
import pandas as pd
import json
import random
import numpy as np

In [259]:
random.seed(20)  # for reproduciablity

The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer).

In [None]:
def convert_unified_qa(dataset_url):
    # download using pandas
    ds = pd.read_csv(dataset_url, on_bad_lines="skip", names=["Question", "Answer"], sep="\t")
    # get name for metatdata
    ds_name = dataset_url.split("/unifiedqa/data/")[1].split("/")[0]
    # get conversation templates list
    conv_funcs = converter_functions[ds_name]

    # create conversation forest
    conversation_forest = []
    for item in ds.itertuples():
        # get q,a from table
        question = item.Question
        answer = item.Answer
        if question == np.nan or answer == np.nan:
            print("Skipped")
        # get a random conversation generatore function
        conv_func = random.choice(conv_funcs)
        try:
            conv_list = conv_func(question, answer)
        except:
            print("!!!!!!!!!!!! Skipped one example")
            #             print(conv_func)
            #             print(question)
            #             print(answer)
            continue
        if not is_valid_conversation(conv_list, item.Question, item.Answer):
            print("!!!!!!!!!!!! Skipped one example")
            continue
        # build nodes and tree
        root = ConversationTreeNode(text=conv_list[0], role="prompter", children=[], metadata=None)
        prev_node = root
        for i in range(1, len(conv_list)):
            role = "prompter"
            if i % 2 == 1:
                role = "assistant"
            next_node = ConversationTreeNode(text=conv_list[i], role="assistant", children=[], metadata=None)
            prev_node.children.append(next_node)
            prev_node = next_node
        conversation_tree = ConversationTree(root=root, metadata={"dataset": ds_name})

        # save the tree to the forest
        conversation_forest.append(conversation_tree)

    conversation_forest_json = [
        json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest
    ]

    print(json.dumps(conversation_forest_json, indent=4), file=open(f"./{ds_name}.json", "w+"))

    print("Finished converting dataset")
    print(" ")
    print("*****", ds_name, "****")
    # print(ds.head(2))
    print(print_conv(conversation_forest[0].root))

In [261]:
for url in urls:
    convert_unified_qa(url)

!!!!!!!!!!!! Skipped one example
Finished converting dataset
 
***** natural_questions ****
which is the most common use of opt-in e-mail marketing?
a newsletter sent to an advertising firm's customers

Finished converting dataset
 
***** narrativeqa ****
I am going to be asking you some questions about the following text
Sure, I can help you with understanding and analyzing a text. What is the text that you would like m
  At Madeline Hall, an old mansion-house near Southampton belonging to the wealthy de Versely family
Okay, what question do you have about the text?
Who is Miss Delmer? 
 the elderly spinster aunt of the Earl de Verseley and Captain Delmar 

!!!!!!!!!!!! Skipped one example
Finished converting dataset
 
***** newsqa ****
How many Americans are part of the federal food assistance program? \n (CNN) -- As Walter Thomas kno
31 million

Finished converting dataset
 
***** drop ****
I am going to be asking you some questions on the following text: To start the season, the Li