Model output#
Directory structure#
The model-output
1The directory is required, but the name is flexible. You can
use a custom directory path by setting the "model_output_dir"
property in the
admin.json
file. More details can be found in the admin.json
schema
definition. directory in a modeling hub is required to have the following subdirectory and file structure:
model_id1
<round_id1>-<model_id1>.csv
(or parquet, etc)<round_id2>-<model_id1>.csv
(or parquet, etc)
model_id2
<round_id1>-<model_id2>.csv
(or parquet, etc)
model_id3
<round_id1>-<model_id3>.csv
(or parquet, etc)
where model_id
= team_abbr-model_abbr
Expected patterns#
The elements making up model output directory and file names must match the following patterns:
round_id
s must be either ISO formatted dates (YYYY-MM-DD) or any combination of alphanumerics separated by underscores (_
).model_id
s are composed ofteam_abbr
andmodel_abbr
separated by a hyphen (i.e.team_abbr-model_abbr
).team_abbr
andmodel_abbr
must contain any combination of alphanumerics separated by underscores (_
).
Note that file names are also allowed to contain the following compression extension prefixes: .snappy, .gzip, .gz, .brotli, .zstd, .lz4, .lzo, .bz2, e.g. <round-id1>-<model_id>.gz.parquet
.
Example model output directory structure#
With ISO date round_id
s#
hub-baseline
2022-10-12-hub-baseline.csv
2022-10-19-hub-baseline.csv
team_1-ensemble
2022-10-12-team_1-ensemble.parquet
2022-10-19-team_1-ensemble.gz.parquet
With alphanumeric round_id
s#
hub-baseline
2024_2025_1_covid-hub-baseline.csv
2024_2025_1_flu-hub-baseline.csv
team_1-ensemble
2024_2025_1_covid-team_1-ensemble.parquet
2024_2025_1_flu-team_1-ensemble.gz.parquet
Example model submission file#
Each model submission file will have the same representation for each hub. Here is an example of a hub that collects mean and quantile forecasts for one-week-ahead incidence, but probabilities for the timing of a season peak:
|
|
|
|
|
|
---|---|---|---|---|---|
EW202242 |
weekly rate |
1 |
mean |
2The NA2The |
5 |
EW202242 |
weekly rate |
1 |
quantile |
0.25 |
2 |
EW202242 |
weekly rate |
1 |
quantile |
0.5 |
3 |
EW202242 |
weekly rate |
1 |
quantile |
0.75 |
10 |
EW202242 |
weekly rate |
1 |
pmf |
0 |
0.1 |
EW202242 |
weekly rate |
1 |
pmf |
0.1 |
0.2 |
EW202242 |
weekly rate |
1 |
pmf |
0.2 |
0.7 |
EW202242 |
peak week |
NA |
pmf |
EW202240 |
0.001 |
EW202242 |
peak week |
NA |
pmf |
EW202241 |
0.002 |
EW202242 |
… |
… |
… |
… |
… |
EW202242 |
peak week |
NA |
pmf |
EW202320 |
0.013 |
EW202242 |
weekly rate |
1 |
sample |
1 |
3 |
EW202242 |
weekly rate |
1 |
sample |
2 |
3 |
File formats#
Hubs can take submissions in tabular data formats, namely csv
and parquet
. These
submission formats are not mutually exclusive; hubs may choose between
parquet
(Arrow), csv
, or both. Both formats have advantages and tradeoffs:
Considerations about
csv
:Advantages
Compatibility: Files are human-readable and are widely supported by many tools
Disadvantages:
Size: Some projects have run into 100 MB file size limits when using
csv
formatted files.
Considerations about
parquet
:Advantages:
Speed
Size: In combination, splitting files up and using
parquet
would get around GitHub limits on file sizesLoads only data that are needed
Disadvantages:
Compatibility: Harder to work with; teams and people who want to work with files need to install additional libraries
Examples of how to create these file formats in R and Python are listed below in the writing model output section.
Formats of model output#
3Shandross, L., Howerton, E., Contamin, L., Hochheiser, H., Krystalli, A., Consortium of Infectious Disease Modeling Hubs, Reich, N. G., Ray, E.L. (2024). hubEnsembles: Ensembling Methods in R. (under review for publication) (Repo: https://github.com/hubverse-org/hubEnsemblesManuscript).Reference
Much of the material in this section has been excerpted or adapted from the hubEnsembles manuscript.3Shandross, L., Howerton, E., Contamin, L., Hochheiser, H., Krystalli, A., Consortium of Infectious Disease Modeling Hubs, Reich, N. G., Ray, E.L. (2024). hubEnsembles: Ensembling Methods in R. (under review for publication) (Repo: https://github.com/hubverse-org/hubEnsemblesManuscript).
Model outputs are a specially formatted tabular representation of predictions.
Each row corresponds to a unique prediction, and each column provides information about what is being predicted, its scope, and its value.
Per hubverse convention, there are two groups of columns providing metadata about the prediction4When using models for downstream analysis with the collect_hub()
function in the hubData
package, one more column called model_id
is prepended that identifies the model from its filename., followed by a value column with the actual output. Each group of columns serves a specific purpose: (1) the “task ID” columns provide details about what is being predicted, and (2) the two “model output representation” columns specify the type of prediction and identifying information about that prediction. Finally, (3) the value column provides the model output of the prediction. Details about the column specifications can be found below.
Details about model output column specifications#
As shown in the model output submission table above, there are three “task ID” columns: origin_epiweek
, target
, and horizon
; and there are two “model output representation” columns: output_type
and output_type_id
followed by the value
column.
More detail about each of these column groups is given in the following points:
“Task IDs” (multiple columns): The details of the outcome (the model task) are provided by the modeler and can be stored in a series of “task ID” columns as described in this section on task ID variables. These “task ID” columns may also include additional information, such as any conditions or assumptions used to generate the predictions. Some example task ID variables include
target
,location
,reference_date
, andhorizon
. Although there are no restrictions on naming task ID variables, we suggest that hubs adopt the standard task ID or column names and definitions specified in the section on usage of task ID variables when appropriate.“Model output representation” (2 columns): consists of two columns specifying how the model outputs are represented. Both of these columns will be present in all model output data:
output_type
specifies the type of representation of the predictive distribution, namely"mean"
,"median"
,"quantile"
,"cdf"
,"cmf"
,"pmf"
, or"sample"
.output_type_id
specifies more identifying information specific to the output type, which varies depending on theoutput_type
.
value
contains the model’s prediction.
The following table provides more detail on how to configure the three “model output representation” columns based on each model output type.
|
|
|
---|---|---|
|
|
Numeric: the mean of the predictive distribution |
|
|
Numeric: the median of the predictive distribution |
|
Numeric between 0.0 and 1.0: a probability level |
Numeric: the quantile of the predictive distribution at the probability level specified by the output_type_id |
|
String or numeric: a possible value of the target variable |
Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id |
|
String naming a possible category of a discrete outcome variable |
Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable. |
|
Positive integer sample index |
Numeric: a sample from the predictive distribution. |
Note
The model output type IDs have different caveats depending on the output_type
:
mean
andmedian
Point estimates do not have an
output_type_id
because you can only have one point estimate for each combination of task IDs. However, because theoutput_type_id
column is required, something has to go in this place, which is a missing value. This is encoded asNA
in R andNone
in Python. See The example on writing parquet files for details.pmf
Values are required to sum to 1 across all
output_type_id
values within each combination of values of task ID variables. This representation should only be used if the outcome variable is truly discrete; a CDF representation is preferred if the categories represent a binned discretization of an underlying continuous variable.sample
Depending on the hub specification, samples with the same sample index (specified by the
output_type_id
) may be assumed to correspond to a single sample from a joint distribution across multiple levels of the task ID variables — further details are discussed below.cdf
(andpmf
for ordinal variables)In the hub’s
tasks.json
configuration file, the values of theoutput_type_id
should be listed in order from low to high.
Writing model output to a hub#
The model output follows the specification of the tasks.json
configuration
file of the hub. If you are creating a model and would like to know what
data type your columns should be in, the Hubverse has utilities to provide an
arrow schema and even a full submission
template from the tasks.json
configuration file.
When submitting model output to a hub, it should be placed in a folder with the
name of your model_id
in the model outputs folder specified by the hub
administrator (this is usually called model-output
). Below are R and Python
examples for writing Hubverse-compliant model output files in both CSV and
parquet format. In these examples, we are assuming the following variables
already exist:
hub_path
is the path to the hub cloned on your local computermodel_id
is the combination of<team_abbr>-<model_abbr>
file_name
is the file name of your model formatted as<round_id>-<model_id>.csv
(or.parquet
)model_out
is the tabular output from your model formatted as specified in the formats of model output section.
Submission Template#
The hubverse package hubValidations
has functionality
that will generate template data to get you started. This submission
template can be written as a CSV or parquet file and then imported in to
whatever software you use to run your model.
Here is some example code that can help. In this example, hub_path
is the
path to the hub on your local machine.
# read the configuration file and get the latest round
config_tasks <- hubUtils::read_config(hub_path)
rounds <- hubUtils::get_round_ids(config_tasks)
this_round <- rounds[length(rounds)]
# create the submission template (this may take some time if your submission uses samples)
tmpl <- hubValidations::submission_tmpl(config_tasks = config_tasks, round_id = this_round)
You can then either write this template to a csv file with the readr
package:
# write the template to a csv file to use in your model code.
readr::write_csv(tmpl, "/path/to/template.csv")
OR you can write it to a parquet file with the arrow
package:
# write the template to a parquet file to use in your model code.
arrow::write_parquet(tmpl, "/path/to/template.parquet")
Example: model output as CSV#
The sections below provide examples for writing CSV model output files. A note
that missing data in a CSV file should be either a blank cell (that is, two
adjacent commas ,,
) or NA
without quotes5You can quote me on this: No quotes. (e.g. ,NA,
).
Writing CSV with R#
When writing a model output file in R, use the readr
package.
# ... generate model data ...
outfile <- fs::path(hub_path, "model-output", model_id, file_name)
readr::write_csv(model_out, outfile)
Writing CSV with Python#
This example uses the pandas
package when creating CSV model output files.
import pandas as pd
import os.path
# ... generate model data ...
outfile = os.path.join(hub_path, "model-output", model_id, file_name)
model_out.to_csv(outfile, index = False)
Example: model output as parquet#
Unlike a CSV, a parquet files contain embedded information about the data types of its columns. Therefore, when writing model output files as parquet, it’s critical that you first ensure the data type of your columns matches the expected type from the Arrow schema.
If the data types of the model output parquet file don’t match the hub’s schema, the submission will not validate. In practice, you will need to know whether or not the expected data type is a string/character, float/numeric, or an Int/integer.
Arrow Schema#
The hubverse packages hubData
and hubUtils
have functionality that will generate an
arrow schema so that you can ensure your output matches the expected type.
Here is some example code that can help. In this example, hub_path
is the
path to the hub on your local machine.
# read the configuration file and get the latest round
config_tasks <- hubUtils::read_config(hub_path, "tasks")
schema <- hubData::create_hub_schema(config_tasks)
The schema output will look something like this:
Schema
origin_date: date32[day]
target: string
horizon: int32
location: string
age_group: string
output_type: string
output_type_id: double
value: int32
model_id: string
Writing parquet with R#
You can use the
hubData::coerce_to_hub_schema()
, function to ensure your data is in the correct format before writing out.
# ... generate model data ...
outfile <- fs::path(hub_path, "model-output", model_id, file_name)
# coerce model output data to the data types of the hub schema
config_tasks <- hubData::read_config(hub_path, "tasks")
model_out <- hubData::coerce_to_hub_schema(model_out, config_tasks)
# write to parquet file
arrow::write_parquet(model_out, outfile)
Writing parquet with Python#
his example uses the pandas
package to create parquet files. Importantly, if
you are creating a parquet file, you will need to ensure your column types
match the hub schema. You can do this by using the astype()
method for pandas DataFrames6If you prefer to use polars for your model output, you would use the polars cast()
method.
import pandas as pd
import os.path
# ... generate model data ...
outfile = os.path.join(hub_path, "model-output", model_id, file_name)
# update the output_type_id data type to match the hub's schema
model_out["output_type_id"] = model_out["output_type_id"].astype("float") # or "string", or "Int64"
model_out.to_parquet(outfile)
Model output relationships to task ID variables#
We emphasize that the mean
, median
, quantile
, cdf
, and pmf
representations all summarize the marginal predictive distribution for a single combination of model task ID variables.
In contrast, we cannot assume the same for the sample
representation.
By recording samples from a joint predictive distribution, the sample
representation may capture dependence across combinations of multiple model task ID variables.
For example, suppose the model task ID variables are “forecast date”, “location”, and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location, and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task ID variables, including:
the joint predictive distribution across all locations and horizons within each forecast date
the joint predictive distribution across all horizons within each forecast date and location
the joint predictive distribution across all locations within each forecast date and horizon
the marginal predictive distribution for each combination of forecast date, location, and horizon
Hubs should specify the collection of task ID variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.
More details about sample-output-type can be found in the page describing sample output type data.
Omitted output types#
Some other possible model output representations have been proposed but not included in the list above. We document these other proposals and the reasons for their omissions here:
Point estimates#
Bin probability. Two notes:
If the bins have open left endpoints and closed right endpoints, bin probabilities can be calculated directly from CDF values.
We considered a system with a more flexible specification of bin endpoint inclusion status but noted two disadvantages:
This additional flexibility would introduce substantial extra complexity to the metadata specifications and tooling
Adopting limited standards that encourage hubs to adopt settings consistent with the common definitions might be beneficial. For example, if a hub adopted bins with open right endpoints, the resulting probabilities would be incompatible with the conventions around cumulative distribution functions.
Probability of “success” in a setting with a binary outcome
This can be captured with a CDF representation if the outcome variable is ordered or a categorical representation if the outcome variable is not.
Compositional. For example, we might request a probabilistic estimate of the proportion of hospitalizations next week due to influenza A/H1, A/H3, and B.
Note that a categorical output representation could be used if only point estimates for the composition were required.
Validating prediction values#
Before model outputs can be incorporated into a hub, they must be validated. If a hub is centrally stored on GitHub, validation checks will be automatically performed for each submission (via the validate_pr()
function from the hubValidations
R package).
Teams can also validate their submissions locally via the function
validate_submissions()
from the hubValidations
R package, which performs two validation tasks:
Validation based on rules that can easily be encoded in the JSON schema, such as ranges of expected values and
output_type_id
s.Validation of more involved rules that cannot be encoded in a json schema are implemented separately (such as specific relationships between outputs and targets). You can find a table with the details of each check in the
validate_submission()
documentation and thevalidate_pr()
documentation.
The importance of a stable model output file schema#
NOTE
The following discussion addresses two different types of schemas:
hubverse schema—the schema used for validating hub configuration files
Arrow schema—the mapping of model output columns to Arrow data types.
This section concerns parquet files, which encapsulate a schema within the file, but the broader issues have consequences for all output file types.
Model output data are stored as separate files, but we use the hubData
package to open them as a single Arrow dataset.7Even if you do not use hubData
to read model outputs, uniform schemas are still important if you want to join model output files and do analyses across submissions.
It is necessary to ensure that all files conform to the same Arrow schema (i.e., share the same column data types) across the hub’s lifetime.
When we know that all data types conform to the Arrow schema, we can be sure that a hub can be successfully accessed and is fully queryable across all columns as an Arrow dataset
Thus, additions of new rounds should not change the overall hub schema at a later date (i.e., after submissions have already started being collected).
Many common task IDs should have consistent and stable data types because they are validated against the task IDs in the hubverse schema during model submission. However, there are several situations where a single consistent data type cannot be guaranteed, e.g.:
New rounds introducing changes in custom task ID value data types, which are not covered by the hubverse schema.
New rounds introducing changes in task IDs covered by the schema but which accept multiple data types (e.g.,
scenario_id
where bothinteger
andcharacter
are accepted orage_group
where no data type is specified in the hubverse schema).Adding new output types might introduce
output_type_id
values of a new data type.
While config file validation will alert hub administrations to discrepancies in task ID value data types across modeling tasks and rounds, modifications that change the overall data type of model output columns after submissions have been collected could cause downstream issues and should be avoided. Changing the overall data type of model output columns can cause a range of issues (in order of increasing severity):
data type casting being required in downstream analysis code that used to work,
not being able to filter on columns with data type discrepancies between files before collecting
errors when opening hub model output data with popular analytics tools like Arrow, Pandas, and Polars
The output_type_id
column data type#
Output types are configured and handled differently than task IDs in the hubverse.
On the one hand, different output types can have output type ID values of varying data type, and adhering to these data types is imposed by downstream, output type-specific hubverse functionality like ensembling or visualization.
For example, hubs expect double
output type ID values for quantile
output types but character
output type IDs for a pmf
output type.
On the other hand, the use of a long format for hubverse model output files requires that these multiple data types are accommodated in a single output_type_id
column.
This characteristic makes the output type ID column unique within the model output file in terms of how its data type is determined, configured, and validated.
During submission validation, two checks are performed on the output_type_id
column:
Subsets of
output_type_id
column values associated with a given output type are checked for being able to be coerced to the correct data type defined in the config for that output type. This check ensures that correct output type–specific downstream data handling is possible.The overall data type of the
output_type_id
column matches the overall hub schema expectation.
Determining the overall output_type_id
column data type automatically#
To determine the overall output_type_id
data type, the default behavior is to automatically detect the simplest data type that can encode all output type ID values across all rounds and output types from the config.
The benefit of this automatic detection is that it provides flexibility to the output_type_id
column to adapt to the output types a hub is collecting. For example, a hub that only collects mean
and quantile
output types would, by default, have a double
output_type_id
column.
However, the risk of this automatic detection arises if the hub also starts collecting a pmf
output type after submissions have begun in subsequent rounds. If this happens, it would change the default output_type_id
column data type from double
to character
and cause a conflict between the output_type_id
column data type in older and newer files when trying to open the hub as an arrow
dataset.
Fixing the output_type_id
column data type with the output_type_id_datatype
property#
To enable hub administrators to configure and communicate the data type of the output_type_id
column at a hub level, the hubverse schema allows for using an optional output_type_id_datatype
property.
This property should be provided at the top level of tasks.json
(i.e., sibling to rounds
and schema_version
) and can take any of the following values: "auto"
, "character"
, "double"
, "integer"
, "logical"
, "Date"
and can be used to fix the output_type_id
column data type.
{
"schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v*/tasks-schema.json",
"rounds": [...],
"output_type_id_datatype": "character"
}
If not supplied or if "auto"
is set, the default behavior of automatically detecting the data type from output_type_id
values is used.
This feature gives hub administrators the ability to future-proof the output_type_id
column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema by setting the column to "character"
(the safest data type that all other values can be encoded as) at the start of data collection.