Encoding Datasets Into PNGs

Encoding Datasets Into PNGs



Gridded meteorological datasets can be quite large. It's somewhat easy to find data grids that are over 1000 rows by 1000 columns. Sometimes, this creates data files that are quite large, and this can become really problematic if you have to work with a large quantity of massive datasets. If you don't have the disc space available, how are you going to make use of all this data? If a high degree of precision is not important, there is a way to accomplish this that is surprisingly efficient, especially if your dataset contains a lot of blank or null entries. The solution utilizes the compression techniques used to create Portable Network Graphics (PNG) files, which is essentially an image file format.

All you need to do is encode the data into red-green-blue (RGB) triplets and save the data to a PNG file. Whatever process you use to do this, make sure there is plenty of spacing between individual red, green, and blue values. Otherwise, when you decode the data, you may have issues with data loss or with data being loaded incorrectly.

Probably the safest option would be to use red, green, and blue values that are in divisions of 16 or 32 depending on the desired amount of precision. Since the highest value for red, green, or blue is 255; using a separation of 32 essentially translates into three digits with Base 8, which allows you to store a maximum value of 511. This may not sound like a large number, but a lot of physical variables don't have meaningful precision past the tenths place, so anything that is bounded by 51.1 units can be stored using a separation value of 32. If you use a separation value of 16, the maximum possible value would be 4095. Using lower separation values will potentially yield greater precision, but also carries a greater likelihood of data corruption.

You could also reserve a specific RGB triplet to serve as a "null value" (i.e. something that denotes missing data). Probably the best triplet to use for this would be 255 for red, 255 for green, and 255 for blue; which largely adheres to previous computing conventions that reserve 0xff (255) for missing data.

Here is a Python function showing one strategy for encoding data using a separation value of 16:

def encode(x, factor, offset):
    z = int(round(factor * (x - offset)))

    R = 0
    G = 0
    B = 0

    # 16 squared
    if (z >= 256):
        # Using a bit shift since the computation is more efficient; this essentially divides by 256
        R = z >> 8
        z = z - R * 256
    if (z >= 16):
        # Using a bit shift since the computation is more efficient; this essentially divides by 16
        G = z >> 4
        z = z - G * 16
    if (z >= 1):
        B = z
        z = z - B

    # This bit shift essentially multiplies by 16, which creates a desired separation of 16
    # Also make sure that anything greater than 255 gets capped at 255
    R = min(R << 4, 0xff)
    G = min(G << 4, 0xff)
    B = min(B << 4, 0xff)

    return (R, G, B)


example = encode(3.14, 100, 0)
print(example)


And to decode the data from the RGB triplet:

def decode(RGB, factor, offset):
    R = RGB[0] >> 4
    G = RGB[1] >> 4
    B = RGB[2] >> 4

    z = offset + ((256 * R + 16 * G + B) / factor)

    return z


exampleDecoded = decode(example, 100, 0)
print(exampleDecoded)


In the above, the variable "factor" is a value that essentially ensures the values will extend the full range. For example, if you're encoding something that has values ranging from 0 to 15, then a sensible value for "factor" would be about 270 (270 × 15 = 4050, which is close to the value of 4095 from above).

The variable "offset" represents a value that will be subtracted to ensure that the encoded values start from a number that is approximately 0. For example, if you're encoding something that has values ranging from 271 to 306, then the offset should be 271. Note that both "factor" and "offset" are values that you'll have to know beforehand when converting the PNG file back into the actual data.

Once the encoding scheme has been devised, it then becomes a matter of cycling through each grid point and using an imaging library to create the PNG file from an array. To recover the data, simply invert the process; use an imaging library to load the PNG file's RGB triplets and convert each pixel to the original value.

How effective is this? Consider a 256 × 256 grid of rainfall values where the totals might range from 0 inches to 10 inches. Using a "factor" of 400 and an "offset" of 0, it is possible to encode a PNG file that consumes anywhere from 5 KB to 15 KB of data with a precision of ±0.04 units (inches in this case). If you used a straight one-to-one binary file and reserved 2 bytes of data for each grid point, that would amount to 131,072 bytes (131 KB) of data (2 × 256 × 256) with a precision of ±0.01 units. Using the PNG file will decrease the precision by a factor of 4, but will reduce the file size by a factor of 10 to 25 (depending on how much redundancy there is in the grid). Whether or not that's an acceptable trade off depends on what you're doing, but it's still an impressive way to conserve disc space.

I also tried this with a Network Common Data Form (NetCDF) file containing sea-surface temperature (SST) data. The grid is 24 × 170 × 180 (24 values for time, 170 values for latitude, and 180 values for longitude). Since this is SST data, a lot of grid points are just blank because of land masses. And, since water is very resistant to changes in temperature, the range of values is also relatively low (from about 271 K to 306 K, a range of about 35 K). Encoding the same data into 24 PNG files (one for each time value) amounted to 688 KB of data with a mean error of 0.0025 K (0.0045 F). For perspective, the original NetCDF file was 2.9 MB and the observational error for most temperature sensors is about 0.1 K (0.18 °F). Zipping all of the PNG files into a single zip archive reduced the size even further to 664 KB with no further data loss. The end result is a reduction in disc space by a factor of 4.6 with a nearly negligible increase in imprecision.

If the results seem too good to be true, here is a link to the aforementioned data file and download links for the Python programs so that you can verify the results for yourself:
NetCDF File (note that the file name in question is called "tos_O1_2001-2002.nc")
encode.py
decode.py

You can also improve the efficiency by tiling the grids if working with three-dimensional data. Using the example above (24 times × 170 latitudes × 180 longitudes), you could generate a single large PNG image that is 1080×680 pixels (180 × 6 horizontal tiles and 170 × 4 vertical tiles). As long as you remember the order in which the tiles were created, you can encode and decode data that is three-dimensional. Incase you're curious, using a single large PNG file instead of 24 individual PNG files reduced the file size to 638 KB with essentially the same amount of error.

If small amounts of imprecision are acceptable for what you're doing, using PNG files or their underlying compression techniques should be a consideration if you're constrained by disc space. I don't recommend using this for numerical model outputs or data that is logarithmic in nature (a small imprecision could translate to a massive miscalculation in this case!). However, for metrics that are observed in the real atmosphere (temperature, mixing ratio, rainfall, wind speed), the imprecisions from this compression technique should be smaller than the error of the weather instrumentation. Therefore, it's worth trying this technique on any grids of any such variables. And, as an added bonus, PNG files are a lightweight and universally recognized format that don't require any special software to view, so you can easily embed large grids of data into emails (some email services still have a 25 MB attachment size limit like it's still 2008). All you need to do is make sure the recipient(s) know(s) how to decode the image data.

You can also achieve greater precision by including the alpha (opacity) channel with the RGB triplets. The alpha values also range from 0 to 255, and including this fourth "digit" would yield anywhere from 4095 unique values (using a separation of 32) to 65,535 unique values (using a separation of 16). However, this will also increase the disc space requirement, but the increase in precision will usually outweigh the disc space burden.

You could also achieve even more precision if you dedicate more than one pixel to each individual grid point. As in, instead of having 1 pixel for 1 grid point, you could allocate 2 pixels for 1 grid point, increasing the number of "digits" to 6 (or 8 if you also use the alpha channel). However, this will also translate to larger file sizes, but the PNGs should still be smaller than the original datasets.

Theoretically, this could also be used for compressing massive spreadsheets, but you'll have to be creative about how you deal with lengthy strings. If your spreadsheet contains only numerical values and classifications (e.g. someone's favorite color), then it would be possible to encode a spreadsheet into a PNG file. All you have to do is set aside one color for each classification that appears in the spreadsheet.

What motivated the implementation of this idea? As part of my PhD work, I needed to generate hypothetical rainfall grids by running multiple randomly generated trials using different initial conditions. The main issue was the number of combinations that I had to evaluate for every single trial. Even though the grids were relatively small (256 × 256), I did the math and found that my dataset would require over 5 TB of disc space (10 trials × 4,212,200 combinations × 131 KB per trial), which I did not have access to. Even after cutting out some intermediate and outlying combinations, the disc space requirement was still way too high.

As a matter of debugging the algorithm, I created some PNGs of the gridded data just to visualize what it looked like. I noticed that the PNGs were much smaller than the binary data files I was trying to generate. So, I investigated the possibility of compressing my gridded data into PNG files, and the disc space requirement was reduced by several orders of magnitude without losing critical amounts of information. Since datasets aren't getting any smaller, I thought I'd share this concept incase anyone else is dealing with the same issue.

If you're interested in working with this compression technique, you can download an open-source Python package by clicking here.