Defining modeling tasks#
Every hub is organized around a set of “modeling tasks”. These modeling tasks define how the target data should be modeled in terms of
what variables to use for modeling (e.g., date, location, variant, etc) and
the specific format of the model output.
The tasks.json
configuration file1Due to technical issues, we do not currently support json references or yaml metadata files. for a hub is used to structure the modeling tasks so that model submissions can be rapidly validated.
Modeling tasks are defined for single or multiple rounds2For multiple rounds to share the same tasks without duplicating the model_tasks
block, round_id_from_variable
can be set to true
, and the round_id
should be a column defined in the task_ids
. See the tasks.json
schema for details..
The three components of modeling tasks are:
The
task_ids
object defines both labels for columns in submission files and the set of valid values for each column. Any unique value combination defines a single modeling task or target.The
output_type
object defines accepted representations for each task. The model output section provides more information on the different output types.The
target_metadata
array provides additional information about each target.
Task ID variables#
Hubs typically specify that modeling outputs (e.g., forecasts or projections) should be generated for each combination of values across a set of task ID variables. Because they are central to Hubs, task ID variables serve several purposes:
Define modeling tasks of the hub in the hub metadata
Identify modelling tasks corresponding to forecasts in the model outputs
Allow alignment of model outputs with target data that are derived from “ground truth” data sources.
The following diagram illustrates the relationships between these items at a high level, and the following sections provide more detail.

A modeling hub works as an ecosystem of resources from the hub administrators, modeling teams, and hubverse developers.#
Usage of task ID variables#
Task ID variables represent columns in model output files. It’s important to understand that model output files are in tabular format (e.g., csv or parquet). Moreover, these tables are presented in a long/narrow representation where each row of data represents a unique combination of task ID variables and a single value from the model output3This type of data is also known as “tidy data,” a term coined by Hadley Wickham that’s heavily used in the R community. You can read more about the concept in the Data tidying chapter of the R4DS book and the Tidy Data paper by Wickham (2014)..
In the tasks.json
file, task ID variables are a collection of JSON objects
that define required and optional values for these variables. In the example below from the COVID-19 variant nowcast hub, there are four task ID variables defined: "nowcast_date"
, "target_date"
, "location"
, and "clade"
.
"task_ids": {
"nowcast_date": {
"required": ["2024-09-11"],
"optional": null
},
"target_date": {
"required": null,
"optional": ["2024-08-11", "2024-08-12", "2024-08-13", "2024-08-14", "2024-08-15", "2024-08-16", "2024-08-17", "2024-08-18", "2024-08-19", "2024-08-20", "2024-08-21", "2024-08-22", "2024-08-23", "2024-08-24", "2024-08-25", "2024-08-26", "2024-08-27", "2024-08-28", "2024-08-29", "2024-08-30", "2024-08-31", "2024-09-01", "2024-09-02", "2024-09-03", "2024-09-04", "2024-09-05", "2024-09-06", "2024-09-07", "2024-09-08", "2024-09-09", "2024-09-10", "2024-09-11", "2024-09-12", "2024-09-13", "2024-09-14", "2024-09-15", "2024-09-16", "2024-09-17", "2024-09-18", "2024-09-19", "2024-09-20", "2024-09-21"]
},
"location": {
"required": null,
"optional": ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "PR"]
},
"clade": {
"required": ["24A", "24B", "24C", "recombinant", "other"],
"optional": null
}
}
In this particular round, modelers MUST submit predictions with the
"nowcast_date"
of 2024-09-11 with all of the "clade"
s (24A, 24B, 24C,
recombinant, and other). Any submissions that contain anything other than those
exact values will result in an error. In contrast, modellers MAY submit
predictions for ANY of the "target_date"
s between 2024-08-11 and 2024-09-21
and ANY of the states listed in "location"
. By allowing modelers to submit a
subset of optional values, it means poor-performing models can be omitted so
they do not negatively influence model ensembles. It also allows models that
only have the capacity for a subset of the optional options to still
participate.
Special task ID variables#
Task ID variables are used to parameterize modeling efforts.
However, some task ID variables serve specific purposes in defining submission rounds and targets.
Every hub must have a single task ID variable that uniquely defines a submission round.
It has become a convention to use a task ID formatted in the YYYY-MM-DD
format (e.g., origin_date
or forecast_date
).
For example, in Running Example 1, this task ID is origin_date
.
There can be one or more task ID variables to define a modeling “target” (these are referred to in the tasks metadata as a target_key
).
For example, in our Running Example 1, the task ID variables are target
, location
, and origin_date
.
In this example, target
is the target key and can only take on one value, “inc covid hosp”.
Derived Task ID variables#
Each model output task is based on unique combinations of task ID values. For example, for a given origin_date
which is a task ID which often acts as the round ID and as the starting projection date (week 0; let’s say “2024-11-07”), 2 location
s, and 2 horizon
values, there are 4 unique tasks (1 origin_date
× 2 location
s × 2 horizon
s).
However, it is possible to have task ID variables that are derived directly from others. For example, target_date
(which represents when the outcome of interest occurs) can be calculated based on the origin_date
and horizon
(e.g. origin_date + horizon * 7
to calculate weekly predictions). If the origin_date
is "2024-11-07"
and the horizon
is 1 week, the target_date
will be "2024-11-14"
. For a horizon
of 2 weeks, the target_date
will be "2024-11-21"
. Such task IDs therefore have a one-to-one relationship to values of the task IDs they are derived from. We strongly advise hub administrators to add the derived information and calculation in their documentation.
By adding a target_date
task ID to the above example we would still have a total of 4 unique tasks since target_date
is derived from the origin_date
and horizon
task IDs and each horizon
produces a unique valid target_date
per location.
While derived task IDs like target_date
are helpful for modeling and visualization, they break assumptions made during some validation tests and can also put significant strain on validation performance. As such they should generally be ignored during standard validation and custom or optional checks added to the validation workflow to check their relationship to the task IDs they are derived from (see for example documentation on the opt_check_tbl_horizon_timediff()
used to check the time difference between values in two date columns equals a defined time period defined by values in a horizon column).
In schema version 4.0.0, we introduced derived_task_ids
properties to enable
hub administrators to define derived task IDs (i.e. task IDs whose values
depend on the values of other task IDs) in their hub config files. The higher level derived_task_ids property
sets the property globally at the hub level but can be overridden by
the round level derived_task_ids
property. The property allows for primarily
validation functionality to ignore such task IDs when appropriate which can
significantly improve validation efficiency. For more information see the
hubValidations documentation on ignoring derived task
IDs.
Note
If any task IDs with required
values have dependent derived task IDs, it is essential for derived_task_ids
to be specified. Otherwise, this will result in false validation errors.
Take for example a scenario where target_date
is derived from origin_date
and horizon
via target_date = origin_date + horizon * 7
. If you have a required origin_date
value of “2024-11-07”, then
your tasks.config
must include target_date
in derived_task_ids
. Without
specifying this, modelers will end up with a
req_vals
check failure even if their submission file is valid.
If a model submission file looks like this:
|
|
|
… |
---|---|---|---|
2024-11-07 |
1 |
2024-11-14 |
… |
2024-11-07 |
2 |
2024-11-21 |
… |
When the tasks.json
has "derived_task_ids": ["target_date"]
, then the
submission will pass the validation checks ✅.
However, without setting derived_task_ids
in tasks.json
, the submission
will result in an ❌ <error/check_failure>
whether the target_date
content is valid or not. This will include table indicating the “missing” required task ID combinations as shown in the table below.
|
|
|
validation result |
---|---|---|---|
2024-11-07 |
1 |
2024-11-14 |
✅ |
2024-11-07 |
2 |
2024-11-21 |
✅ |
2024-11-07 |
1 |
2024-11-21 |
❌ |
2024-11-07 |
2 |
2024-11-14 |
❌ |
If you inspect the table, you will notice that these combinations are invalid
because the values in target_date
are not correctly aligned with the
origin_date
and horizon
(i.e. 2024-11-21 is two weeks ahead of
2024-11-07, not 1 week, as indicated by the horizon
).
Standard task ID variables#
While there are no general restrictions on task ID column names or definitions, using the standard task ID names described below ensures that they are strongly validated against the hubverse schema. We therefore strongly suggest that Hubs adopt the following standard task ID or column names and definitions4As Hubs define new modeling tasks, they may need to introduce new task ID variables that have not been used before. In those cases, the new variables should be added to this list to ensure that the concepts are documented centrally and can be reused in future efforts.:
origin_date
: the starting point that can be used for calculating atarget_date
via the formulatarget_date = origin_date + horizon * time_units_per_horizon
(e.g., with weekly data,target_date
is calculated asorigin_date + horizon * 7
days). Another reasonable choice fororigin_date
isreference_date
.forecast_date
: usually defines the date a model is run to produce a forecast.scenario_id
: a unique identifier for a scenariolocation
: a unique identifier for a locationtarget
: a unique identifier for the target. It is recommended, although not required, that hubs set up a single variable to define the target (i.e., as a target key), with additional detail specified in thetarget_metadata
array.target_date
/target_end_date
: for short-term forecasts, one of the synonymous task IDstarget_date
/target_end_date
specifies the date of occurrence of the outcome of interest. For instance, if models are requested to forecast the number of hospitalizations on 2022-07-15, thetarget_date
is 2022-07-15.horizon
: The difference between thetarget_date
and theorigin_date
in time units specified by the hub (e.g., days, weeks, or months)age_group
: a unique identifier for an age group
Output types#
The output_type
object defines accepted model output representations for each task. These define what kind of model output is expected, what range of values
we expect, if multiple values are expected, what identifies those values (e.g. a bin, category, or ID), and
whether or not the output type is required for submission.
To illustrate how output types are represented in tasks.json
, here is an
example of a quantile output type:
1"quantile": {
2 "output_type_id": {
3 "required": [
4 0.01,
5 0.5,
6 0.99,
7 ]
8 },
9 "value": {
10 "type": "integer",
11 "minimum": 0
12 },
13 "is_required": true
14}
From the code block above, you can see that an output type has four components:
(line 1)
"quantile"
the name of the output type representation (e.g."cdf"
,"mean"
,"median"
,"quantile"
,"pmf"
,"sample"
)(line 2)
"output_type_id"
In the case of quantiles, the output type ID is an indication of the quantile bins. Unlike task IDs, alloutput_type_id
s are required (see note below).(line 9)
"value"
the expected value type and range. In this case, the values from this model should be non-negative integers.(line 13)
"is_required"
an indication if this output type is required or not. In this example, submissions without this output type would fail.
The formats of model output section from the model output chapter provides more information on the different output types.
Note
In version 4 of the schemas, we have officially disallowed optional output type IDs. The reason behind this logic is that, unlike task IDs, missing output type IDs have consequences for downstream model scoring and ensembling.
Specifically, these two scenarios are possible if a complete set of quantile bins are not included:
When teams submit different subsets of quantiles and we use a score like WIS to evaluate the model, the scores are different and not comparable when computed on different quantiles. So any end-user would have to take some care to ensure that they are making a comparison on just a subset of required quantiles.
When building ensembles, if you just collected all quantile forecasts without ensuring that you had a complete set of all quantiles from all forecasters, you might combine quantiles from one subset of forecasters for some quantiles and have a different combination of forecasters for other quantiles.
Target metadata#
Target metadata is an array in the tasks.json
schema file that defines each target’s characteristics.
It serves as a logical connection between task_ids
and corresponding output_types
:
flowchart LR subgraph task-id["task_ids"] target end subgraph output-type["output_type"] vars["[output type objects]"] end subgraph target-metadata["target_metadata"] subgraph tk["target_keys"] tktarget["target"] end target-type["target_type"] end tktarget -->|"matches"| target target-type -->|"corresponds to"| vars
Example#
Here is an example of how the target metadata fields might appear in the tasks.json
schema for a Hub whose target is incident COVID-19 hospitalizations.
"target_metadata": [
{
"target_id": "inc covid hosp",
"target_name": "Daily incident COVID hospitalizations",
"target_units": "count",
"target_keys": {
"target": "inc covid hosp"
},
"description": "Daily newly reported hospitalizations where the patient has COVID, as reported by hospital facilities and aggregated in the HHS Protect data collection system.",
"target_type": "discrete",
"is_step_ahead": true,
"time_unit": "day"
}
]
Details#
Target metadata comprises the following fields:
target_id
: a short description uniquely identifying the target.target_name
: a longer, human-readable description of the target, which could be used as a visualization axis label.target_units
: the unit of observation used for this target.target_keys
: a set of one or more name/value pairs that must match a target defined in thetask_ids
section of the schema. Each value, or the combination of values if multiple keys are specified, defines a single target value.description
: a verbose explanation of the target, which might include details on the measure used for the target or a definition of ‘rate’, for example.target_type
: the target’s statistical data type that must correspond to theoutput_type
section of the schema.The following table lists the possible values for
target_type
(rows) and the correspondingoutput_type
(columns). AnX
indicates that the output type can be used with the target type, and a-
means that it can not be used. We note that for the binary data type row, mean and medianoutput_type
are X’ed for definitional consistency, but in practice, the hubverse recommends using pmf or sampleoutput_type
as a more natural way to represent these values.target_type
mean
median
quantile
cdf
pmf
sample
continuous
X
X
X
X
-
X
discrete
X
X
X
X
X
X
nominal
-
-
-
-
X
X
binary
X
X
-
-
X
X
date
X
X
X
X
X
X
ordinal
-
X
X
X
X
X
compositional
X
X
-
-
-
X
is_step_ahead
: a Boolean value that indicates whether the target is part of a sequence of values, defined bytime_unit
.time_unit
: Whenis_step_ahead
istrue
, this field should be one of"day"
,"week"
, or"month"
, defining the unit of time steps. This field will be ignored whenis_step_ahead
isfalse
.