Build from source

This instruction is for someone who wants to build Wiki Edit Blocks dataset from source (Wikimedia Edit History dump CSV files). It is not necessary unless you want to build something special.

Basically, you don't need to build from source because the entire Wiki Edit Blocks should be available online. However, sometimes you might want to save network bandwidth and/or use a different Wiki source dump, then this instruction is for you.

Environment preparation

  • You definitely need a CPU cluster machine because building from source could take more than 3 days on 8 processes (aka. 8-core CPU machine). Scaling the number of processes might decrease the running duration needed, but the decrease is worse than linear and it may eventually hit the limitation of disk I/O bandwidth.

  • On 8-core CPU machine, for example, you will need at least 2TB (probably 3TB) of temporary storage space to process it safely. If you scale up the number of processes, this storage requirement will grow up linearly. When storage space is not enough, the process might get stuck and it never ends, leading to broken data.

  • Consider decrease the number of processes if you don't have enough of storage to support parallel decompression.

Besides that, you will need the suggested Python environment as same as in Preparation. In addition to that (you need bloark installed), you will also need to install the following Python package from PyPI:

wikidl==0.1.0

Download source dump

Use the following Python file for downloading Wikipedia Edit History dump source:

from wikidl import WikiDL
import logging

downloader = WikiDL(
    num_proc=3,
    snapshot_date='20240201',  # You can update this to the latest dump date
    select_pattern='ehd',  # This means: Edit History Dump (EHD)
    log_level=logging.DEBUG,  # Use DEBUG to show the progress in detail
)
downloaded_files = downloader.start(output_dir='./input')

Setting num_proc to be more than 3 is not suggested due to the fairness consideration at Wikimedia. You should use at most 3 processes to download in parallel.

A few things to keep in mind:

  • The number of processes is suggested to be at most 3. This is because Wikimedia limits the parallel connection to be 3. More than 3 processes connecting to Wikimedia will be rejected with a 503 error.

  • You need to pass a valid snapshot date. Check here for all valid dates right now. They update dumps for every half month.

  • Usually, the output directory of WikiDL is named ./input because the source will be used as input to the next step.

Then, simply running this Python script will download the source dump. If you want to utilize the CPU cluster resource, check out Run on Slurm cluster.

Start building process

Then, use the following script to build:

import logging
import bloark


if __name__ == '__main__':
    # Create a builder instance with 8 processes and INFO-level logging.
    builder = bloark.Builder(output_dir='./output', num_proc=8, log_level=logging.INFO)

    # Preload all files from the input directory (original data sources).
    # This command should be instant because it only loads paths rather than files themselves.
    builder.preload('./input')

    # For testing purposes, we only build the first 10 files.
    # This way of modification is possible, but not recommended in production.
    # builder.files = builder.files[:10]

    # Start building the warehouses (this command will take a long time).
    builder.build()

It is very simple on your end. It just takes some time to run. If you are unsure about the result, try running it with only the first 10 files and/or use less processes.

Last updated