Configuring tasks#

Every hub is organized around “modeling tasks” that are defined to meet the needs of a project. Modeling tasks are defined for a hub in the tasks.json file, which specifies the three components of modeling tasks (task ID variables, output types, and target metadata). You can read a detailed definition of modeling tasks in the defining modeling tasks section.

Step 1: Open tasks.json#

Make sure you are in the hub-config folder. Then, click on tasks.json to open the file.

Screenshot of how to open tasks.json file in RStudio

Step 2: Examine the tasks.json file#

You should see the code below in your source panel (upper left-hand panel). It has a specific structure that you can explore in the interactive tasks.json schema explorer

Screenshot of the code in the tasks.json file

This tasks.json file serves as a template and has very few values filled out, allowing users to adapt the hub to their needs. Nevertheless, to learn how to use this schema properly, we will use a “premade” tasks.json file from the simple forecast hub example that already has values filled in, which will better illustrate what should go in each section.

Step 3: Close the tasks.json file in RStudio#

Ensure the tasks.json file in RStudio is closed by clicking the ‘x’ icon, as indicated below.

Screenshot of how to close tasks.json file in RStudio

Step 4: Download a premade tasks.json file#

You can use the link to download the tasks.json file from the example forecast hub by clicking on the Download Raw File icon as indicated below.

Screenshot of how to download a tasks.json file from GitHub

Save the file in the hub-config folder (which is in your repository on your local computer). This new file should replace the existing tasks.json file in this folder.

4.1: Examine the new tasks.json file#

Open tasks.json and explore the content and structure. Simple explanations for elements in the example forecast hub file are offered below (click on the hyperlinks for a full explanation of all the supported elements in a tasks.json file and definitions of key concepts):

  • schema_version: Modeling hub schema versions are all housed in the schemas repository.

  • round_id: The round identifier establishes which date from a forecast submission is used to identify the submission round it corresponds to (e.g., the origin date).

  • model_tasks: Model tasks include all the goals of the modeling effort, including the task_ids, output_type, and target_metadata.

  • task_ids: The task identifiers set the optional and required elements in a forecast submission, such as the target, horizon, location, and origin date.

  • origin_date: The date when a forecast was generated. More information on this and other dates, including using the origin_date to calculate the target_date, can be found in the section on using task ID variables.

  • horizon: Sets the time range for which forecast predictions are to be made. For instance, these can be days into the future or even days into the past, as in nowcasts.

  • location: The geographic identifier, such as country codes or FIPS state/county level codes.

  • output_type: A model output type establishes the valid model output types, such as the mean or specific quantiles. A more detailed explanation of model outputs can be found in the section on model output formats.

Now, read below for details on some of the lines of code in this file:

Step 5: Define "task_ids"#

5.1. Establishing the "round_id" and "origin_date" (starting point):#

  • The code highlighted in green establishes that the round identifier is encoded by a task id variable in the data.

  • The code highlighted in light blue sets the round identifier as "origin_date".

  • task_ids includes the variables origin_date, target, horizon, and location.

  • The first variable, origin_date is highlighted in yellow and states that no origin dates are required and that there are three valid, possible dates ("2022-11-28", "2022-12-05", "2022-12-12"). To be clear, no specific origin_date is required because every submission will have a different origin_date as each submission corresponds to a different forecasting period (compare this with location, where some specific locations may be required for every submission.

Some of the initial lines of code in the tasks.json file

5.2. Setting the "target":#

  • The second line states that "inc covid hosp" for example is the required target. Additional required targets could be added here.

  • The third line states that there are no other optional targets that are valid. You could add ["cum covid hosp"], for example, if you wanted to allow that target but not require it.

Some lines of code in the tasks.json file

5.3. Setting up the "horizon":#

  • The horizon refers to the difference between the target_date and the origin_date in time units specified by the hub (these could be days, weeks, or months).

  • The second line indicates that no horizons are required.

  • The third line states that the forecast can be for up to 6 days before the origin_date, and up to 14 days after the origin_date.

More lines of code in the tasks.json file

5.4. Setting up "location":#

  • The location refers to geographic identifiers such as country codes or FIPS state/county level codes.

  • The second line states that no particular location is required, although, in some instances, specific locations might be required for all submissions.

  • The third line indicates the locations that may be submitted. The locations in this example correspond to FIPS codes for US states and territories.

Even more lines of code in the tasks.json file

5.5. required and optional elements:#

As seen previously, each task_ids has a required and an optional property to indicate expected and possible additional information, respectively.

  • To indicate no possible additional information, optional can be set to null.

  • If required is set to null but optional contains values, (see for example "location"): no particular value is required, but at least one of the optional values is expected.

  • There may be cases where we have multiple model_tasks and a given task ID is relevant to one or more model tasks but not others. For example, in the code snippet below; on lines 8–14, the horizon task id is relevant to the first model task, whose target is inc covid hosp, and any one of the optional values specified is expected in the horizon column in a model output file. However, shown on lines 30–36, horizon is irrelevant to the second model task, whose target is peak size. For this model task, both required and optional are set to null in the horizon task ID configuration, and a missing value is expected in the horizon column in model output files.

 1"model_tasks": [
 2    {
 3        "task_ids": {
 4            "origin_date": {
 5                "required": null,
 6                "optional": ["2022-11-28"]
 7            },
 8            "target": {
 9                "required": ["inc covid hosp"],
10                "optional": null
11            },
12            "horizon": {
13                "required": null,
14                "optional": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
15            },
16            "location": {
17                "required": ["US"],
18                "optional": null
19            }
20        },
21        "output_type": {...},
22        "target_metadata": [...]
23    },
24    {
25        "task_ids": {
26            "origin_date": {
27                "required": null,
28                "optional": ["2022-11-28"]
29            },
30            "target": {
31                "required": ["peak size hosp"],
32                "optional": null
33            },
34            "horizon": {
35                "required": null,
36                "optional": null
37            },
38            "location": {
39                "required": ["US"],
40                "optional": null
41            }
42        },
43        "output_type": {...},
44        "target_metadata": [...]
45    }
46]

Step 6: Define "output_type":#

The output_type specifies the types of outputs accepted for a given model task. This example includes mean and quantile, but median, cdf, pmf, and sample are other supported output types.

Each output type contains the following properties:

  • output_type_id (or output_type_id_params in the case of sample): specifies/configures valid output type ID values.

  • value: specifies the format and rules for the output’s values, such as the data type and any limits (e.g., minimum or maximum values).

  • is_required: A true/false property to indicate whether an output type is required.

6.1. Setting the "mean":#

  • Here, the "mean" of the predictive distribution is set as a valid output_type value for a submission file.

  • 1In a CSV file, this is represented as either a blank cell (default for Python) or NA (default for R).
  • "output_type_id" must be null for all point estimates (mean, median, etc). No output type IDs for these estimates are applicable, so in a model submission file, they are represented as missing data1In a CSV file, this is represented as either a blank cell (default for Python) or NA (default for R)..

  • "value" sets the characteristics of this valid output_type (i.e., the mean). In this instance, the value must be an integer greater than or equal to 0.

  • "is_required" defines if a mean prediction is required (true) or if it is optional (false). The code below shows that the mean output type is optional.

 1"mean": {
 2    "output_type_id": {
 3        "required": null
 4    },
 5    "value": {
 6        "type": "integer",
 7        "minimum": 0
 8    },
 9    "is_required": false
10}

6.2. Setting up "quantile":#

  • Here, the quantile object configures what values are accepted in the output_type_id and value columns when the value in the output_type column in a submission file is "quantile".

  • 2During validation, the quantile output type IDs are compared as character strings instead of as numeric (floating point) values. There is a good reason for this: floating point numbers have precision problems.
  • "output_type_id" (line 2) establishes the accepted probability levels at which quantiles of the predictive distribution will be recorded. In this case, quantiles are required at discrete levels that range from 0.01 to 0.99. Quantile output_type_id values must NOT contain trailing zeros as this will cause submission validation checks to fail2During validation, the quantile output type IDs are compared as character strings instead of as numeric (floating point) values. There is a good reason for this: floating point numbers have precision problems..

  • As before, "value" (line 29) sets the characteristics of valid quantile value values. In this instance, the values must be integers greater than or equal to 0.

  • "is_required" (line 33) defines whether a quantile prediction is required (true) or if it is optional (false). The code below shows that the quantile output type is required.

 1"quantile": {
 2    "output_type_id": {
 3        "required": [
 4            0.01,
 5            0.025,
 6            0.05,
 7            0.1,
 8            0.15,
 9            0.2,
10            0.25,
11            0.3,
12            0.35,
13            0.4,
14            0.45,
15            0.5,
16            0.55,
17            0.6,
18            0.65,
19            0.7,
20            0.75,
21            0.8,
22            0.85,
23            0.9,
24            0.95,
25            0.975,
26            0.99,
27        ]
28    },
29    "value": {
30        "type": "integer",
31        "minimum": 0
32    },
33    "is_required": true
34}

Step 7: Defining "target_metadata":#

  • "target_metadata" defines the characteristics of each unique target.

  • To begin with, "target_id" is a short description that uniquely identifies the target.

  • Similarly, "target_name" provides a longer, human readable description of the target.

  • "target_units" indicates the unit of observation used for this target. In this instance, the unit is "count".

  • "target_keys" expect either a null value or an object containing a single key/value pair that appropriately identifies the target. If providing a null value, the hub is assumed to have a single target that is not encoded in any task ID variables. The name of the target is instead provided through the target_id property. If a key/value pair is provided, the key must match the name of a task ID variable and the single value must match a valid task ID value of the task ID variable specified in the key. In this instance, the key is the task ID variable name "target" and the value is "inc covid hosp".

  • The "description" is a verbose explanation of the target, which might include details on the measure used for the target, as shown in the example below.

  • The "target_type" defines the target’s statistical data type. In this instance, the target uses discrete data.

  • "is_step_ahead" indicates whether the target is part of a sequence of values. In this instance, it is.

  • "time_unit" defines the units of the time steps. In this case, it is days.

 1"target_metadata": [
 2    {
 3        "target_id": "inc covid hosp",
 4        "target_name": "Daily incident COVID hospitalizations",
 5        "target_units": "count",
 6        "target_keys": {
 7            "target": "inc covid hosp"
 8        },
 9        "description": "Daily newly reported hospitalizations where the patient has COVID, as reported by hospital facilities and aggregated in the HHS Protect data collection system.",
10        "target_type": "discrete",
11        "is_step_ahead": true,
12        "time_unit": "day"
13    }
14]

Step 8: Set up "submissions_due":#

  • "submissions_due" establishes the dates by which model forecasts must be submitted to the hub. It is used by hubValidations when validating submission files.

The schema permits two ways to set the dates during which model forecasts can be submitted:

  1. By setting a "relative_to" date, as well as "start" and "end" integers, which set the range of dates in which submissions are accepted, as explained in the example below.

  • "relative_to" specifies the task id variable with which submission start and end dates are calculated. In this instance, it is "origin_date".

  • "start" is a number used to calculate the beginning of the submission period, based on the origin_date. In this example, the start date is six days prior to origin_date.

  • On the other hand, "end" is a number used to calculate when the submission period is finished, based on the origin_date. In this example, the end date is one day after origin_date.

  • For instance, as was mentioned before, in this file, 2022-11-28 is allowed as an origin_date. In this case, submissions are due between “2022-11-22” (six days prior) and “2022-11-29” (one day after).

Last lines of code in the tasks.json file
  1. By setting explicit "start" and "end" dates (rather than integers as in the previous case), during which forecast submissions are accepted. In this case, the "relative_to" line should be omitted altogether, as in the example below:

"submissions_due": {
    "start": "2022-06-07",
    "end": "2022-07-20"
}

Step 9: Optional - set up "output_type_id_datatype":#

Once all modeling tasks and rounds have been configured, you may also choose to fix the output_type_id column data type across all model output files of the hub using the optional "output_type_id_datatype" property.

This property should be provided at the top level of tasks.json (i.e., sibling to rounds and schema_version) and can take any of the following values: "auto", "character", "double", "integer", "logical", "Date".

{
  "schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v*/tasks-schema.json",
  "rounds": [...],
  "output_type_id_datatype": "character"
}

For more context and details on when and how to use this setting, please see the output_type_id column data type section on the model output page.