Build from source
This instruction is for someone who wants to build Wiki Edit Blocks dataset from source (Wikimedia Edit History dump CSV files). It is not necessary unless you want to build something special.
Basically, you don't need to build from source because the entire Wiki Edit Blocks should be available online. However, sometimes you might want to save network bandwidth and/or use a different Wiki source dump, then this instruction is for you.
Environment preparation
You definitely need a CPU cluster machine because building from source could take more than 3 days on 8 processes (aka. 8-core CPU machine). Scaling the number of processes might decrease the running duration needed, but the decrease is worse than linear and it may eventually hit the limitation of disk I/O bandwidth.
On 8-core CPU machine, for example, you will need at least 2TB (probably 3TB) of temporary storage space to process it safely. If you scale up the number of processes, this storage requirement will grow up linearly. When storage space is not enough, the process might get stuck and it never ends, leading to broken data.
Consider decrease the number of processes if you don't have enough of storage to support parallel decompression.
Besides that, you will need the suggested Python environment as same as in Preparation. In addition to that (you need bloark
installed), you will also need to install the following Python package from PyPI:
Download source dump
Use the following Python file for downloading Wikipedia Edit History dump source:
Setting num_proc
to be more than 3 is not suggested due to the fairness consideration at Wikimedia. You should use at most 3 processes to download in parallel.
A few things to keep in mind:
The number of processes is suggested to be at most 3. This is because Wikimedia limits the parallel connection to be 3. More than 3 processes connecting to Wikimedia will be rejected with a 503 error.
You need to pass a valid snapshot date. Check here for all valid dates right now. They update dumps for every half month.
Usually, the output directory of WikiDL is named
./input
because the source will be used as input to the next step.
Then, simply running this Python script will download the source dump. If you want to utilize the CPU cluster resource, check out Run on Slurm cluster.
Start building process
Then, use the following script to build:
It is very simple on your end. It just takes some time to run. If you are unsure about the result, try running it with only the first 10 files and/or use less processes.
Last updated