Working with WorkData

WorkData is managed by the cloud platform to ensure the data is safe and passed to and from ProcessingSteps correctly. If any data is needed by a ProcessingStep, and is not in PineXQ already, you need to upload it.

Uploading WorkData

To upload a file follow these steps:

Create a WorkData object using a configured client.
Execute create() to upload your data.
(optional) Edit metadata, such as tags, or allow later deletion.

from pinexq.client.job_management import WorkData
from pinexq.client.core import MediaTypes

with open("TestData/test.npy", "rb") as testfile:
    workdata = (WorkData(client=self.client)
                            .create(filename="test.npy",
                                file=testfile,
                                mediatype=MediaTypes.OCTET_STREAM)
                            .set_tags(["TestData"])
                            .allow_deletion())

Note that in addition to files, Workdata.create() can also accept JSON-able objects (json=obj) and binary streams (binary=stream), though supplying more than one will throw an error.

Alternatively, it is possible to upload WorkData via the portal. You can then use the WorkData URL to configure a step.

As a reminder, media types must match those expected by the processing step to ensure compatibility. The default media type MediaTypes.OCTET_STREAM = "application/octet-stream" is intended for binary files. For more common media types, see: MDN - Common media types

Using input WorkData

Once we have uploaded WorkData, we can give it to the job.

# workdata created previously

# partially-configured job
# index denotes the n-th dataslot as specified by the configured ProcessingStep
job.assign_input_dataslot(index=0, work_data_instance=workdata)

# alternative using urls
job.assign_input_dataslot(index=0, work_data_url="<url>")

DataSlot collections, passing multiple WorkData to a single slot, are handled in a similar way:

# workdata created previously

# partially-configured job
# index denotes the n-th dataslot as specified by the configured ProcessingStep
job.assign_collection_input_dataslot(index=0, work_data_instances=[workdata])

# alternative using urls
job.assign_input_dataslot(index=0, work_data_urls=["<url>"])

It is also possible to provide input data when configuring using the single call option:

from pinexq.client.job_management import InputDataSlotParameterFlexible, Job

input_data_slot = InputDataSlotParameterFlexible(index=0, work_data_instances=[workdata])
processing_step = Processing_Step.from_name(
    client=client,
    function_name="my_function",
    function_version="1.0.0"
)
job = Job(client).create_and_configure_rapidly(
    name="my_job",
    processing_step_instance=processing_step, # or processing_step_url="..."
    tags=["test", "my_function"],
    input_data_slots=[input_data_slot],
    start=True
)
result = job.get_result()
job.delete()

Downloading output WorkData

As a job executes, output files are created automatically; while configuring a job, you may choose to make these files (more easily) deletable. Result WorkData is handled in a different way: since it is not clear which files (or, in case of collection DataSlots, how many files) are written, the files are pre-allocated when the job is scheduled. For normal collection DataSlots, the maximum number of files specified by the ProcessingStep is allocated. The files will be given a generic name, which can be changed later.

After a job is complete, we can look up its output dataslots, and get a list of WorkData. Each WorkData includes metadata about the name of the corresponding output DataSlot, and additional information.

In the following snippet, we wait for a job to complete, then download the output DataSlot “my_output” and the result DataSlot to respective local files:

from pinexq.client.job_management import WorkData

# previously started job
job.wait_for_completion(timeout_s=600.0)
output_data_slots = job.get_output_data_slots()
output_workdata = {slot.title : WorkData.from_hco(slot.assigned_workdatas[0])
                    for slot in output_data_slots}

with open("output.npy", "rb") as file:
    file.write(output_workdata["my_output"].download())

# result dataslot is an output dataslot with this name
with open("result.npy", "rb") as file:
    file.write(output_workdata["__returns__"].download())

As usual, WorkData will be placed in your WorkData tab on the portal, and can be searched and downloaded from there as well.

Uploading large files

HTTP requests for large files might time out on the client side. This is a known limitation at the moment. To avoid this the client should have a generous timeout for uploads to wait for a response of the server.

The used HTTP client, the httpx library, has a default 5 second timeout, which is often not sufficient for large file uploads over a network. The overall timeout of the httpx client can be adjusted like this:

client.timeout = 60.0 # in seconds;  default is 5s

Secret Workdata

A workdata can be marked as ‘secret’. Such workdata will be forbidden from being downloaded.

work_data_covenience_class_instance.set_as_secret()