Checkout and get your data
This instruction is all about checking the structure of your data and decompress them when needed.
Loading data as source
Before any execution, you need to preload the dataset to BloArk instance (it could be among Builder
, Reader
, or Modifier
). This step only loads the file path and does NOT load any actual warehouse into the memory, so feel free to do that anyway — it won't cook up your machine.
There are a few things to notice:
You need to specify an output directory on any BloArk instance you create, including Reader instance. The reason is that:
All reading execution requires some levels of decompression, so we definitely need a place to store temporary files during the process.
You can decompress warehouses by using Reader instance, so we want you to specify that clearly.
You may optionally modify
reader.files
variable when needed, such as only decompressing the first 10 warehouses. It is just a string list.Default
num_proc
on any BloArk instance is 1 for safety reason. If you are running it on a multi-core CPU machine, please increase thenum_proc
variable.Make sure you have enough of disk space for the decompressed content. The compression rate is about 1:100. So for 100MB warehouse, please expect 10GB of extracted content.
Make sure you have enough of disk space for the temporary processing!!! This is important because when you put multiple CPU on this task (setting
num_proc
), files will be decompressed in parallel. Please make sure that you havenum_proc * single_decompressed_file_size
available for temporary usage.This is useful on CPU clusters (not your personal computer) because the permanent storage is limited but temporary storage is usually really really huge. For more information, contact your computation cluster manager.
Please do not allocate processes more than CPUs you actually have, especially on large computing resource. This could get you into soft "jail" on CPU clusters. It is safe to proceed on your personal machine.
Checking data structure
Sometimes, you might want to check out the data structure of a dataset because they might be modified (we will cover the modification part later). It is worth to know the data structure of a dataset efficiently so that you don't have to decompress warehouses and try to open a huge file on, for example, your device.
The method to check data structure is also very simple:
This action will decompress one warehouse. So please make sure that your remaining storage space can hold at least one decompressed warehouse.
Sample entry
The sample entry (aka. random entry) should look like this:
This is a random entry (aka. block in Blocks Architecture definition) selected from preloaded warehouses. We actively assume that all blocks in a dataset have the same data structure. All data source maintainer should obligate to this assumption when creating a dataset.
An entry is a JSON object (dictionary), so you can use them just like using a dict.
For example, if I want to get the text content of this block (usually a revision checkpoint), I can use the following code:
Data structure
Sometimes, getting a sample entry might not confusing. You can also check out the second variable coming from glimpse function to help you better understand the block structure.
It usually looks like this:
It just tells you what is the type of each key in the sample entry.
Please note that it only checks the structure from the sample entry. If other blocks have a different structure, BloArk cannot detect it (due to the resource limitation).
Decompress
This step is not required! It is here just because you might need it some day. You can decompress warehouses using our module very easily. No need to write Python code yourself or decompress using third-party softwares.
Although that, you can definitely decompress using any decompressing software that supports z-standard when needed.
First, this could take a long long long long time to run depending on warehouse sizes, number of processes, and CPU performance. You cannot pause or resume once started. You can only stop the process and start all over again.
Second, please make sure that you have enough of disk space before doing this because this might end up filling all available spaces on your computer.
Finally, consider only loading a small amount of warehouses once to compensate small storage space.
Last updated