Huggingface核心模块(二): datasets

2023-07-01 huggingface huggingface, nlp 0 评论字数统计: 4.5k(字) 阅读时长: 22(分)

Huggingface中的datasets是一个非常重要的类，他可以帮助我们快速地加载数据集，同时还可以对数据集进行处理，包括数据集的划分、缓存、下载等。本文将详细介绍datasets的使用方法，包括datasets的基本使用、一些基本方法以及自定义自己的数据集

1. datasets基本使用

1.1 获取Huggingface Hub上的数据集

在Huggingface的官网所有已经存在的数据集，可以直接从Huggingface Hub下载并使用，整个过程可以通过load_datasets（官方文档）来完成，关键参数如下

path：数据集的名称，比如imdb，glue；也可以是通用的产生数据集的脚本，例如json，csv，parquet，text或者.py文件等
name：数据集的子数据集，当一个数据集中包含多个数据集时，就需要这个参数。例如glue数据集下包含sst2、cola、qqp等多个子数据集
data_files：如果本地已经下载了数据集，则通过这个参数来指定数据集的路径
split：划分数据集，可以是train、test、validation，指定后返回单个DataSet对象。如果不指定则为None，返回一个DataDict对象，包含所有数据集
cache_dir：指定数据集的缓存路径，默认为~/.cache/huggingface/datasets
revision：表示数据集的版本

加载Huggingface Hub的imdb数据集

>>> dataset = load_dataset("imdb")
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2.17k/2.17k [00:00<00:00, 6.82MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7.59k/7.59k [00:00<00:00, 11.0MB/s]
Downloading and preparing dataset imdb/plain_text to /Users/harry/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 84.1M/84.1M [00:48<00:00, 1.73MB/s]
Dataset imdb downloaded and prepared to /Users/harry/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 152.18it/s]
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
>>>

只需要一句代码就能够从Huggingface Hub中下载imdb数据集，我们将下载的数据集输出，会有一个DatasetDict对象，包含了train，test和unsupervised三种数据类型，下面介绍数据集的一些参数和使用

1.2 加载本地数据集

这里以意大利语的问答数据集squad_it为例，演示如何下载和处理本地数据集

1. 手动下载并解压squad_it数据集

我们首先手动下载squad_it数据集看一下其数据形式

>>> wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
>>> wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

# 解压
>>> gzip -dkv SQuAD_it-*.json.gz 
SQuAD_it-train.json.gz:    82.2% -- replaced with SQuAD_it-train.json
SQuAD_it-test.json.gz:     87.4% -- replaced with SQuAD_it-test.json

下载完后我们看一下数据集内容

{
    "data": [
        {
            "title": "Terremoto del Sichuan del 2008",
            "paragraphs": [
                {
                    "context": "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
                    "qas": [
                        {
                            "id": "56cdca7862d2951400fa6826",
                            "answers": [
                                {
                                    "text": "2008",
                                    "answer_start": 29
                                }
                            ],
                            "question": "In quale anno si è verificato il terremoto nel Sichuan?"
                        },
                        ...

标题格式为

|-data
    |-title
    |-paragraphs
        |-context
        |-qas
            |-id
            |-answers
                |-text
                |-answer_start
            |-question
            |-id
            |-answers
                |-text
                |-answer_start
            |-question

接下来我们使用load_dataset函数来加载这个数据集

2. 使用load_dataset加载

直接用load_dataset加载数据，以SQuAD_it-train.json为例

>>> squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
Downloading and preparing dataset json/default to /Users/harry/.cache/huggingface/datasets/json/default-e0b956320ae13300/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8774.69it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 463.15it/s]
Dataset json downloaded and prepared to /Users/harry/.cache/huggingface/datasets/json/default-e0b956320ae13300/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 355.36it/s]
>>> print(squad_it_dataset)
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

load_dataset函数里面的json表示加载的数据集脚本是json类型，data_files表示数据集的路径，field表示哪个域名对应的字段，这里第一部分是data，所以我们设为data

加载完后，返回的dataset为DatasetDict类型，当我们不做设置时，默认会将加载的数据集设置为train dataset，我们可以直接对其进行索引得到数据集中的数据

>>> squad_it_dataset["train"][0]
{
    "title": "Terremoto del Sichuan del 2008",
    "paragraphs": [
        {
            "context": "Il terremoto del Sichuan del 2008 o il terremoto...",
            "qas": [
                {
                    "answers": [{"answer_start": 29, "text": "2008"}],
                    "id": "56cdca7862d2951400fa6826",
                    "question": "In quale anno si è verificato il terremoto nel Sichuan?",
                },
                ...

如上所示，他会输出第一个title以及paragraphis里面的内容，这是因为这一部分是data这个列表里的第一个数据

3. 更灵活的加载train和test数据集

默认加载的文件会被设定为train，如果我们想分别设置train和test，可以对load_dataset的data_files参数设置字典对象

>>> data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
>>> squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
Downloading and preparing dataset json/default to /Users/harry/.cache/huggingface/datasets/json/default-deaf5fe77027f091/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12846.26it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 643.50it/s]
Dataset json downloaded and prepared to /Users/harry/.cache/huggingface/datasets/json/default-deaf5fe77027f091/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 462.97it/s]
>>> squad_it_dataset
DatasetDict({
    train: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 442
    })
    test: Dataset({
        features: ['paragraphs', 'title'],
        num_rows: 48
    })
})

上述例子中，我们提前将数据集下载下来，然后通过data_files参数设置train和test。此外，如果数据集的格式为压缩文件或者是存储在云端，load_dataset依然能够自动解压或者下载，如下

# load_dataset能够自动解压
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

# load_dataset能够自动下载
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

2. 对数据集进行操作

下面datasets的操作方法都可以在官方文档找到

2.1 filter方法

filter方法能够过滤数据集中的特定数据，例如在glue/mrpc数据集中我们想保留sentence1中第一个字符为"的句子，过滤其他句子，我们可以通过如下代码实现

>>> raw_datasets = load_dataset("glue", "mrpc")
Found cached dataset glue (/Users/harry/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 375.89it/s]
>>> sentence_sample = raw_datasets['train'].shuffle(seed=42).select(range(100))
>>> sentence_sample[:3]
{'sentence1': ['" The public is understandably losing patience with these unwanted phone calls , unwanted intrusions , " he said at a White House ceremony .', 'Federal agent Bill Polychronopoulos said it was not known if the man , 30 , would be charged .', 'The companies uniformly declined to give specific numbers on customer turnover , saying they will release those figures only when they report overall company performance at year-end .'], 'sentence2': ['" While many good people work in the telemarketing industry , the public is understandably losing patience with these unwanted phone calls , unwanted intrusions , " Mr. Bush said .', 'Federal Agent Bill Polychronopoulos said last night the man involved in the Melbourne incident had been unarmed .', 'The companies , however , declined to give specifics on customer turnover , saying they would release figures only when they report their overall company performance .'], 'label': [0, 0, 1], 'idx': [3946, 3683, 3919]}
>>> sentence_sample = raw_datasets['train']
>>> print(sentence_sample)
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
>>> sentence_sample = sentence_sample.filter(lambda x: x["sentence1"][0]=="\"")
>>> print(sentence_sample)
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 343
})
>>> sentence_sample[:3]
{'sentence1': ['" I think you \'ll see a lot of job growth in the next two years , " he said , adding the growth could replace jobs lost .', '" The result is an overall package that will provide significant economic growth for our employees over the next four years . "', '" We are declaring war on sexual harassment and sexual assault .'], 'sentence2': ['" I think you \'ll see a lot of job growth in the next two years , " said Mankiw .', '" The result is an overall package that will provide a significant economic growth for our employees over the next few years , " he said .', '" We have declared war on sexual assault and sexual harassment , " Rosa said .'], 'label': [0, 1, 1], 'idx': [20, 49, 89]}
>>>

上述filter中的匿名函数lambda也可以替换为正常定义的函数，例如

1 2	def filter_quote(x): return x["sentence1"][0]=="\""

2.2 map方法

map方法允许我们对数据集中的每个样本进行操作，基本参数如下

remove_columns：删除某一列，例如参数中指定remove_columns=["sentence1"]，删除sentence1列
batched：是否批量处理，如果为True，则function函数的输入为batch，否则为单个样本，设置为True会加快map处理速度
num_proc：多进程处理数据，一般在分布式训练时设置为False

1. 将glue/mrpc数据集sentence1中的句子全部变为大写

>>> sentence_sample = raw_datasets['train']
>>> def upper_sentence(example):
...     return example["sentence1"] = example["sentence1"].upper()
... 
>>> sentence_sample = sentence_sample.map(upper_sentence)
>>> sentence_sample[0]
{'sentence1': 'AMROZI ACCUSED HIS BROTHER , WHOM HE CALLED " THE WITNESS " , OF DELIBERATELY DISTORTING HIS EVIDENCE .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}
>>> raw_datasets['train']['sentence1'][0]
'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .'
>>> sentence_sample['sentence1'][0]
'AMROZI ACCUSED HIS BROTHER , WHOM HE CALLED " THE WITNESS " , OF DELIBERATELY DISTORTING HIS EVIDENCE .'

2. 给glue/mrpc数据集增加一列，显示sentence1和sentence2每个句子的长度

注意这里的返回值必须是字典形式，返回的字典会自动给数据集增加一列。同样的，如果想修改原有的数据集参考示例1即可

>>> sentence_sample = raw_datasets['train']
>>> def compute_length(example):
...     sentence_length = {}
...     sentence_length["sentence1_len"] = len(example["sentence1"])
...     return sentence_length
... 
>>> sentence_sample = sentence_sample.map(compute_length)
>>> sentence_sample
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'sentence1_len'],
    num_rows: 3668
})
>>> sentence_sample[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'sentence1_len': 103}
>>>

2.3 rename_column方法

rename_column方法允许我们对每一列重新命名，即标签重命名。例如我们想将glue/mrpc数据集中的sentence1和sentence2重新命名为sen1和sen2

>>> sentence_sample = raw_datasets['train']
>>> sentence_sample = sentence_sample.rename_column(original_column_name="sentence1", new_column_name="sen1")
>>> sentence_sample = sentence_sample.rename_column(original_column_name="sentence2", new_column_name="sen2")
>>> print(sentence_sample)
Dataset({
    features: ['sen1', 'sen2', 'label', 'idx'],
    num_rows: 3668
})

2.4 其他方法

参考

1. sort

对数据集进行排序

1	sortData = dataset.sort('label')

2. shuffle

打乱数据集

1	shuffleData = sortData.shuffle(seed=20)

3. select

选择数据集中指定索引

1	dataset.select([0,1,2,3])

4. filter

过滤数据集

1
2
3

def filter(data):
    return data['text'].startswith('1')
b = dataset.filter(filter)

5. train_test_split

将数据集切分为训练集和测试集，例如将数据集中的10%切分为测试集

1	dataset.train_test_split(test_size=0.1)

6. shard

将数据集氛围若干份，例如将数据集切分为5份，并取出其中第一份

1	dataset.shard(num_shards=5, index=0)

7. rename_column

对列重新命名，这个操作比map中的rename_column方法速度更快，因为不需要copy新数据

1	c = a.rename_column('text', 'newColumn')

8. remove_columns

删除某一列

1	d = c.remove_columns(['newColumn'])

9. map

对数据集中的某个数据进行处理

def handler(data):
  data['text'] = 'Prefix' + data['text']
  return data

datasetMap = dataset.map(handler)

10. save_to_disk/load_from_disk

保存和加载数据

dataset.save_to_disk('./')

from datasets import load_from_disk
dataset = load_from_disk('./')

3. 自定义数据加载脚本

在上面我们直接调用了datasets中的load_dataset方法，从Huggingface Hub或者本地加载数据集，但是如果数据集非常复杂，我们想自定义加载脚本怎么做呢？Huggingface有一套非常完整灵活的数据加载脚本，我们可以参考这里进行自定义数据加载脚本，同时官方还给了一个数据集加载脚本的示例

3.1 完善两个核心类

自定义数据脚本需要完善两个核心类，分别是datasets.BuilderConfig和datasets.GeneratorBasedBuilder，第一个类主要是维护数据集的各种信息，第二个类主要是实现数据集的加载和处理，最后数据加载格式如下

from datasets import BuilderConfig, GeneratorBasedBuilder

class MyBuilderConfig(BuilderConfig):
    def __init__(self, **kwargs):
        super(MyBuilderConfig, self).__init__(**kwargs)
        pass

class MyDatasetBuilder(GeneratorBasedBuilder):
    BUILDER_CONFIGS = [pass]

    def _info(self):
        pass
    def _split_generators(self, dl_manager):
        pass
    def _generate_examples(self, filepath):
        pass

3.2 BuilderConfig的构建

BuilderConfig具有自定义属性

name：数据集的名字，这个名字是数据集有多个子集的时候调用，比如glue数据集有mrpc、cola等子集，这个时候我们可以通过name来区分不同的子集
version：数据集的版本
description：数据集的描述

1
2

BuilderConfig(name="first_domain", version=VERSION, description="This part of my dataset covers a first domain")
BuilderConfig(name="second_domain", version=VERSION, description="This part of my dataset covers a second domain"),

同时我们还可以集成这个类，加入自定义属性，在官网示例中，使用SuperGlueConfig继承BuilderConfig，并添加了label_classes等属性

class SuperGlueConfig(datasets.BuilderConfig):
    """BuilderConfig for SuperGLUE."""

    def __init__(self, features, data_url, citation, url, label_classes=("False", "True"), **kwargs):
        """BuilderConfig for SuperGLUE.

        Args:
        features: *list[string]*, list of the features that will appear in the
            feature dict. Should not include "label".
        data_url: *string*, url to download the zip file from.
        citation: *string*, citation for the data set.
        url: *string*, url for information about the data set.
        label_classes: *list[string]*, the list of classes for the label if the
            label is present as a string. Non-string labels will be cast to either
            'False' or 'True'.
        **kwargs: keyword arguments forwarded to super.
        """
        # Version history:
        # 1.0.2: Fixed non-nondeterminism in ReCoRD.
        # 1.0.1: Change from the pre-release trial version of SuperGLUE (v1.9) to
        #        the full release (v2.0).
        # 1.0.0: S3 (new shuffling, sharding and slicing mechanism).
        # 0.0.2: Initial version.
        super().__init__(version=datasets.Version("1.0.2"), **kwargs)
        self.features = features
        self.label_classes = label_classes
        self.data_url = data_url
        self.citation = citation
        self.url = url

3.3 GeneratorBasedBuilder的构建

GeneratorBasedBuilder是用来下载和处理数据集的类，我们需要继承这个类，并实现三个关键的方法

_info()：定义数据集的信息，包括数据集的features、homepage、citation等
_split_generators()：下载和自定义处理数据集，例如分别处理训练集、验证集、测试集等
_generate_examples()：处理数据集，将数据集处理为我们想要的形式

1. _info()方法

这个方法主要是定义数据集的信息，包括数据集的features、homepage、citation等，其中features是一个datasets.Features对象，我们可以通过datasets.Value来定义数据集的每个特征。下面是一个示例，该方法只需要return一个实例化后的datasets.DatasetInfo类对象即可

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "id": datasets.Value("string"),
                # others
            }
        ),
        supervised_keys=None,
        homepage="https://<url>/",
        citation=_CITATION,
    )

2. _split_generators

此方法主要是下载和自定义处理数据集，例如分别处理训练集、验证集、测试集等，我们需要return一个datasets.SplitGenerator的列表，其中name是数据集的子集名称，gen_kwargs是一个字典，包含了filepath、split等信息，这些信息会传递给_generate_examples方法

def _split_generators(self, dl_manager):
    """Returns SplitGenerators."""
    downloaded_file = dl_manager.download_and_extract("https://url/dataset.zip")
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN, gen_kwargs={"filepath": f"{downloaded_file}/dataset/training_data/"}
        ),
        datasets.SplitGenerator(
            name=datasets.Split.TEST, gen_kwargs={"filepath": f"{downloaded_file}/dataset/testing_data/"}
        ),
    ]

3. _generate_examples

此方法用来处理数据集，他接受_split_generators传入的SplitGenerator类对象，将数据集处理为我们想要的形式

def _generate_examples(self, filepath):
    logger.info("⏳ Generating examples from = %s", filepath)
    ann_dir = os.path.join(filepath, "annotations")
    img_dir = os.path.join(filepath, "images")
    for guid, file in enumerate(sorted(os.listdir(ann_dir))):
        # process data code
        yield guid, {"id": str(guid), "tokens": tokens, "bboxes": bboxes}

3.4 加载数据集

class Funsd(datasets.GeneratorBasedBuilder):
    """Conll2003 dataset."""

    BUILDER_CONFIGS = [
        FunsdConfig(name="funsd", version=datasets.Version("1.0.0"), description="FUNSD dataset"),
    ]

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "id": datasets.Value("string"),
                    "tokens": datasets.Sequence(datasets.Value("string")),
                    "bboxes": datasets.Sequence(datasets.Sequence(datasets.Value("int64"))),
                    "ner_tags": datasets.Sequence(
                        datasets.features.ClassLabel(
                            names=["O", "B-HEADER", "I-HEADER", "B-QUESTION", "I-QUESTION", "B-ANSWER", "I-ANSWER"]
                        )
                    ),
                    "image": datasets.Array3D(shape=(3, 224, 224), dtype="uint8"),
                }
            ),
            supervised_keys=None,
            homepage="https://guillaumejaume.github.io/FUNSD/",
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        downloaded_file = dl_manager.download_and_extract("https://guillaumejaume.github.io/FUNSD/dataset.zip")
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN, gen_kwargs={"filepath": f"{downloaded_file}/dataset/training_data/"}
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST, gen_kwargs={"filepath": f"{downloaded_file}/dataset/testing_data/"}
            ),
        ]

    def _generate_examples(self, filepath):
        logger.info("⏳ Generating examples from = %s", filepath)
        ann_dir = os.path.join(filepath, "annotations")
        img_dir = os.path.join(filepath, "images")
        for guid, file in enumerate(sorted(os.listdir(ann_dir))):
            tokens = []
            bboxes = []
            ner_tags = []

            file_path = os.path.join(ann_dir, file)
            with open(file_path, "r", encoding="utf8") as f:
                data = json.load(f)
            image_path = os.path.join(img_dir, file)
            image_path = image_path.replace("json", "png")
            image, size = load_image(image_path)
            for item in data["form"]:
                words, label = item["words"], item["label"]
                words = [w for w in words if w["text"].strip() != ""]
                if len(words) == 0:
                    continue
                if label == "other":
                    for w in words:
                        tokens.append(w["text"])
                        ner_tags.append("O")
                        bboxes.append(normalize_bbox(w["box"], size))
                else:
                    tokens.append(words[0]["text"])
                    ner_tags.append("B-" + label.upper())
                    bboxes.append(normalize_bbox(words[0]["box"], size))
                    for w in words[1:]:
                        tokens.append(w["text"])
                        ner_tags.append("I-" + label.upper())
                        bboxes.append(normalize_bbox(w["box"], size))

            yield guid, {"id": str(guid), "tokens": tokens, "bboxes": bboxes, "ner_tags": ner_tags, "image": image}

本文链接： https://harrytea.netlify.app/2023/07/01/Huggingface核心模块-二-datasets/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

harryteaustc

computer vision