Checkout and get your data

This instruction is all about checking the structure of your data and decompress them when needed.

Loading data as source

Before any execution, you need to preload the dataset to BloArk instance (it could be among Builder, Reader, or Modifier). This step only loads the file path and does NOT load any actual warehouse into the memory, so feel free to do that anyway — it won't cook up your machine.

There are a few things to notice:

You need to specify an output directory on any BloArk instance you create, including Reader instance. The reason is that:
- All reading execution requires some levels of decompression, so we definitely need a place to store temporary files during the process.
- You can decompress warehouses by using Reader instance, so we want you to specify that clearly.
You may optionally modify reader.files variable when needed, such as only decompressing the first 10 warehouses. It is just a string list.
Default num_proc on any BloArk instance is 1 for safety reason. If you are running it on a multi-core CPU machine, please increase the num_proc variable.
Make sure you have enough of disk space for the decompressed content. The compression rate is about 1:100. So for 100MB warehouse, please expect 10GB of extracted content.
Make sure you have enough of disk space for the temporary processing!!! This is important because when you put multiple CPU on this task (setting num_proc), files will be decompressed in parallel. Please make sure that you have num_proc * single_decompressed_file_size available for temporary usage.
- This is useful on CPU clusters (not your personal computer) because the permanent storage is limited but temporary storage is usually really really huge. For more information, contact your computation cluster manager.
Please do not allocate processes more than CPUs you actually have, especially on large computing resource. This could get you into soft "jail" on CPU clusters. It is safe to proceed on your personal machine.

from bloark import Reader

# TODO: Increase `num_proc` as you need and as what you have.
reader = Reader(num_proc=1, output_dir='./output')
reader.preload('./input')

# NOTE: You may optionally do this if you want to manually filter the files.
#  For example, you want to only decompress the first 10 warehouses.
#  You can even manually set files.
#  This step is not required and not recommended for production running.
reader.files = reader.files[:10]

Checking data structure

Sometimes, you might want to check out the data structure of a dataset because they might be modified (we will cover the modification part later). It is worth to know the data structure of a dataset efficiently so that you don't have to decompress warehouses and try to open a huge file on, for example, your device.

The method to check data structure is also very simple:

sample_entry, data_structure = reader.glimpse()

This action will decompress one warehouse. So please make sure that your remaining storage space can hold at least one decompressed warehouse.

Sample entry

The sample entry (aka. random entry) should look like this:

{
  "article_id": "12992",
  "revision_id": "548247",
  "timestamp": "2002-02-25T15:51:15Z",
  "contributor": {
    "username": "Conversion script",
    "id": "1226483"
  },
  "minor": null,
  "comment": "Automated conversion",
  "model": "wikitext",
  "format": "text/x-wiki",
  "text": {
    "@bytes": "897",
    "@xml:space": "preserve",
    "#text": "The <b>go-fast boat</b> is the [[smuggling]] vessel of choice in many parts of the world in the [[1990s]] and [[2000s|first years]] of the [[21st century]].  Built of solid [[fiberglass]], wide of beam and with a deep \"<tt>V</tt>\" hull form, the typical go-fast carries a ton or more of cargo, several fuel drums, a handheld [[global positioning system]], perhaps a [[cellular telephone]], and a small crew.  With 250-plus horsepower engine, they travel at top speeds of 35-50 knots, slowing little in light chop and still maintaining 25 or more knots in the average five- to seven-foot Caribbean seas.  They are heavy enough to cut through higher waves, although at a slower pace.  With no metallic fittings, go-fasts are rarely detected by [[radar]] except in a flat calm or at close range.\n\nThe [[US Coast Guard]] finds them to be stealthy, fast, seaworthy, and very difficult to intercept."
  },
  "sha1": "8tpqw1jvrnynmtmsypubujx0o3483sw"
}

This is a random entry (aka. block in Blocks Architecture definition) selected from preloaded warehouses. We actively assume that all blocks in a dataset have the same data structure. All data source maintainer should obligate to this assumption when creating a dataset.

An entry is a JSON object (dictionary), so you can use them just like using a dict.

For example, if I want to get the text content of this block (usually a revision checkpoint), I can use the following code:

sample_entry['text']['#text']

Data structure

Sometimes, getting a sample entry might not confusing. You can also check out the second variable coming from glimpse function to help you better understand the block structure.

It usually looks like this:

{
  "article_id": "str",
  "revision_id": "str",
  "timestamp": "str",
  "contributor": {
    "username": "str",
    "id": "str"
  },
  "minor": "NoneType",
  "comment": "str",
  "model": "str",
  "format": "str",
  "text": {
    "@bytes": "str",
    "@xml:space": "str",
    "#text": "str"
  },
  "sha1": "str"
}

It just tells you what is the type of each key in the sample entry.

Please note that it only checks the structure from the sample entry. If other blocks have a different structure, BloArk cannot detect it (due to the resource limitation).

Decompress

This step is not required! It is here just because you might need it some day. You can decompress warehouses using our module very easily. No need to write Python code yourself or decompress using third-party softwares.

Although that, you can definitely decompress using any decompressing software that supports z-standard when needed.

reader.decompress()

First, this could take a long long long long time to run depending on warehouse sizes, number of processes, and CPU performance. You cannot pause or resume once started. You can only stop the process and start all over again.

Second, please make sure that you have enough of disk space before doing this because this might end up filling all available spaces on your computer.

Finally, consider only loading a small amount of warehouses once to compensate small storage space.

PreviousDownload dataset from HF NextModify dataset for your needs

Last updated 1 year ago

Was this helpful?