Modify dataset for your needs

This instruction is about how to further modify the dataset to fit your needs. You can think this as the data preprocessing in NLP.

Goal of this instruction

For example, in this instruction, we will focus on a simple task that extracts all links from Wiki Edit Blocks dataset. So in preprocessing, we need to go over all revisions, extract links from each individual piece of texts, and re-store them in a revision-based structure.

Define a modifier

To do that, you need to define a modifier first. A modifier contains the logic on how we want to change the data stored in a block. For the task I specified above, the modifier definition should look like below:

import logging
import bloark
from grimm import clean_syntax  # Need to install separately.


# Define a modifier profile.
class MyModifier(bloark.ModifierProfile):
    count: int = 0

    def __init__(self):
        self.count = 0

    def block(self, content: dict, metadata: dict):
        self.count += 1
        
        text_content = content['text']['#text']
        text, external_links, internal_links, images = clean_syntax(text_content)
        
        new_content = {
            "clean_text": text,
            "external_links": external_links,
            "internal_links": internal_links,
            "images": images,
        }
        return new_content, metadata

For the above example, I used a package called Grimm to clean out syntax and extract links. This is not installed in our preparation section. If you also want to use it, please check out Grimm's documentation and install it accordingly. Grimm is another package by me so feel free to use it!

Here are a few things that might be helpful for you:

  • For each segment (aka. an article containing many revisions), BloArk create a new Modifier instance. That means — all class-level variables, such as count in this case are shared across different revisions.

  • Block function will be applied to each individual block and should return a tuple. The first element in the tuple should be a dict of block data (at revision level). The second element in the tuple should be a dict of segment metadata (at article level).

  • Metadata is shared across the entire article (segment). The metadata you returned in the last execution of block function will be put into the metadata argument of the next execution of block function (on a new revision).

  • Content is just a dict that contains data stored in each block. This is what to be stored in JSONL files.

  • If you want to remove data...

    • If you want to remove a block instead of updating the stored data, simply return None to the content position, then it will not be re-stored into new warehouses.

    • If you return None to the metadata position, then the entire segment will be removed (not re-stored into new warehouses).

  • You must return metadata from block function even if you do not make any changes to it, unless you want to delete the entire segment!

All class-level variables will only be garbage collected by the end of processing an article. Please be aware of the memory limit if you want to store a huge size of data in class-level variables.

Start a modification process

Once you have defined a modifier, you can use the following script to start running it. You need to configure the input dir, output dir, and number of processes accordingly. To run it efficiently on a CPU cluster, check out Run on Slurm cluster.

if __name__ == '__main__':
    # Create a modifier instance with 8 processes (CPUs) and INFO-level logging.
    modifier = bloark.Modifier(
        output_dir='./output',
        num_proc=8,  # TODO: Adjust accordingly!
        log_level=logging.INFO
    )

    # Preload all files from the input directory (original warehouses).
    modifier.preload('./warehouses')

    # Add the modifier profile to the modifier instance.
    # You can even add multiple profiles to the same modifier.
    modifier.add_profile(MyModifier())

    # Start modifying the warehouses (this command will take a long time).
    modifier.start()

Modification could take a long time to run! Please be advised and plan the usage of your machine ahead!

After that, the entire modification process has been done!

Last updated