Download dataset from HF

This instruction talks about how you can get dataset from HF in an efficient way.

The usage of HugDown package is still experimental. You may face some issues (e.g. parallel communication limit) during the usage. We are maintaining this package in-house, so please feel free to report them by sending email to research@lingxi.li.

In most cases, we will upload processed Wiki Edit Blocks onto HuggingFace. To download files from HuggingFace, you can choose from any methods below:

  • Manually download files by using HuggingFace website, HuggingFace CLI, or file paths.

  • Download using HuggingFace package (I have not tried that yet).

  • Download using HugDown (an in-house solution here).

There is no restriction on how you can get the source dataset. You can even ask someone who already has this dataset to copy it physically and transmit via hard disk.

Dataset size expectations

The estimation of the Wiki Edit Blocks dataset is 320 GB to 400 GB (in Feb 2024). This number is already a compressed size and might get bigger in the future.

The source (Wikimedia CSV dump) is only around 250 GB because:

  • We have re-compressed the data using a faster algorithm so that it could be used faster in production. The drawback is having a lower compression rate. If it gets unmanageable in the future, we may consider switching to another slower algorithm with higher compression rate.

  • We have stored extra metadata for better indexing and querying performance.

Download using HugDown

First, you need to setup a proper Python environment as in Preparation, then you should install the following package:

hugdown==0.2.2

After that, you can use the following Python script to download HuggingFace dataset:

from hugdown import HugDown

# Create a downloader instance and associate the number of CPUs.
downloader = HugDown(num_proc=4)

# Preload files based on HuggingFace repository and file extensions.
downloader.preload_files(
    repo='lilingxi01/wiki-edit-blocks-test1',
    data_files='*.jsonl.zst'
)

# TODO: Filter out files based on your needs, e.g. testing.
downloader.files = downloader.files[:8]

# Assign an output directory and start the download.
downloader.start(output_dir='./downloaded')

If you want to leverage the CPU cluster for downloading, check out Run on Slurm cluster.

Currently, I have not observed any concurrency limitation on HuggingFace server-side. But please do not use too many processes in parallel because it will affect resource fairness and may end up getting into soft "jail". I have tried 4 processes and it is fine for a short-term usage.

Last updated