Modify dataset for your needs
This instruction is about how to further modify the dataset to fit your needs. You can think this as the data preprocessing in NLP.
Goal of this instruction
For example, in this instruction, we will focus on a simple task that extracts all links from Wiki Edit Blocks dataset. So in preprocessing, we need to go over all revisions, extract links from each individual piece of texts, and re-store them in a revision-based structure.
Define a modifier
To do that, you need to define a modifier first. A modifier contains the logic on how we want to change the data stored in a block. For the task I specified above, the modifier definition should look like below:
For the above example, I used a package called Grimm to clean out syntax and extract links. This is not installed in our preparation section. If you also want to use it, please check out Grimm's documentation and install it accordingly. Grimm is another package by me so feel free to use it!
Here are a few things that might be helpful for you:
For each segment (aka. an article containing many revisions), BloArk create a new Modifier instance. That means — all class-level variables, such as
count
in this case are shared across different revisions.Block function will be applied to each individual block and should return a tuple. The first element in the tuple should be a dict of block data (at revision level). The second element in the tuple should be a dict of segment metadata (at article level).
Metadata is shared across the entire article (segment). The metadata you returned in the last execution of block function will be put into the metadata argument of the next execution of block function (on a new revision).
Content is just a dict that contains data stored in each block. This is what to be stored in JSONL files.
If you want to remove data...
If you want to remove a block instead of updating the stored data, simply return
None
to the content position, then it will not be re-stored into new warehouses.If you return
None
to the metadata position, then the entire segment will be removed (not re-stored into new warehouses).
You must return metadata from block function even if you do not make any changes to it, unless you want to delete the entire segment!
All class-level variables will only be garbage collected by the end of processing an article. Please be aware of the memory limit if you want to store a huge size of data in class-level variables.
Start a modification process
Once you have defined a modifier, you can use the following script to start running it. You need to configure the input dir, output dir, and number of processes accordingly. To run it efficiently on a CPU cluster, check out Run on Slurm cluster.
Modification could take a long time to run! Please be advised and plan the usage of your machine ahead!
After that, the entire modification process has been done!
Last updated