Model output

Model output#

Directory structure#

The model-output[model-output] directory in a modeling hub is required to have the following subdirectory and file structure:

model_id1
- <round_id1>-<model_id1>.csv (or parquet, etc)
- <round_id2>-<model_id1>.csv (or parquet, etc)
model_id2
- <round_id1>-<model_id2>.csv (or parquet, etc)
model_id3
- <round_id1>-<model_id3>.csv (or parquet, etc)

where model_id = team_abbr-model_abbr

Expected patterns#

The elements making up model output directory and file names must match the following patterns:

round_ids must be either ISO formatted dates (YYYY-MM-DD) or any combination of alphanumerics separated by underscores (_).
model_ids are composed of team_abbr and model_abbr separated by a hyphen (i.e. team_abbr-model_abbr).
team_abbr and model_abbr must contain any combination of alphanumerics separated by underscores (_).

Note that file names are also allowed to contain the following compression extension prefixes: .snappy, .gzip, .gz, .brotli, .zstd, .lz4, .lzo, .bz2, e.g. <round-id1>-<model_id>.gz.parquet.

Example model output directory structure#

With ISO date `round_id`s#

hub-baseline
- 2022-10-12-hub-baseline.csv
- 2022-10-19-hub-baseline.csv
team_1-ensemble
- 2022-10-12-team_1-ensemble.parquet
- 2022-10-19-team_1-ensemble.gz.parquet

With alphanumeric `round_id`s#

hub-baseline
- 2024_2025_1_covid-hub-baseline.csv
- 2024_2025_1_flu-hub-baseline.csv
team_1-ensemble
- 2024_2025_1_covid-team_1-ensemble.parquet
- 2024_2025_1_flu-team_1-ensemble.gz.parquet

Example model submission file#

Each model submission file will have the same representation for each hub. Here is an example of a hub that collects mean and quantile forecasts for one-week-ahead incidence, but probabilities for the timing of a season peak:

¹The output_type_id for point estimates (e.g. mean) is not applicable. To reflect this, we need to signal that this is a missing value. In R, missing values are encoded as NA, and in Python, they are encoded as None. This is discussed in the output type table

An example of a model output submission for modelA#
`origin_epiweek`	`target`	`horizon`	`output_type`	`output_type_id`	`value`
EW202242	weekly rate	1	mean	NA ¹ ¹The `output_type_id` for point estimates (e.g. `mean`) is not applicable. To reflect this, we need to signal that this is a missing value. In R, missing values are encoded as `NA`, and in Python, they are encoded as `None`. This is discussed in the output type table	5
EW202242	weekly rate	1	quantile	0.25	2
EW202242	weekly rate	1	quantile	0.5	3
EW202242	weekly rate	1	quantile	0.75	10
EW202242	weekly rate	1	pmf	0	0.1
EW202242	weekly rate	1	pmf	0.1	0.2
EW202242	weekly rate	1	pmf	0.2	0.7
EW202242	peak week	NA	pmf	EW202240	0.001
EW202242	peak week	NA	pmf	EW202241	0.002
EW202242	…	…	…	…	…
EW202242	peak week	NA	pmf	EW202320	0.013
EW202242	weekly rate	1	sample	1	3
EW202242	weekly rate	1	sample	2	3

File formats#

Hubs can take submissions in tabular data formats, namely csv and parquet. These submission formats are not mutually exclusive; hubs may choose between parquet (Arrow), csv, or both. Both formats have advantages and tradeoffs:

Considerations about csv:
- Advantages
  - Compatibility: Files are human-readable and are widely supported by many tools
- Disadvantages:
  - Size: Some projects have run into 100 MB file size limits when using csv formatted files.
Considerations about parquet:
- Advantages:
  - Speed
  - Size: In combination, splitting files up and using parquet would get around GitHub limits on file sizes
  - Loads only data that are needed
- Disadvantages:
  - Compatibility: Harder to work with; teams and people who want to work with files need to install additional libraries

Examples of how to create these file formats in R and Python are listed below in the writing model output section.

Formats of model output#

²Shandross, L., Howerton, E., Contamin, L., Hochheiser, H., Krystalli, A., Consortium of Infectious Disease Modeling Hubs, Reich, N. G., Ray, E.L. (2024). hubEnsembles: Ensembling Methods in R. (under review for publication) (Repo: https://github.com/hubverse-org/hubEnsemblesManuscript).

Reference

Much of the material in this section has been excerpted or adapted from the hubEnsembles manuscript. ² ²Shandross, L., Howerton, E., Contamin, L., Hochheiser, H., Krystalli, A., Consortium of Infectious Disease Modeling Hubs, Reich, N. G., Ray, E.L. (2024). hubEnsembles: Ensembling Methods in R. (under review for publication) (Repo: https://github.com/hubverse-org/hubEnsemblesManuscript).

Model outputs are a specially formatted tabular representation of predictions. Each row corresponds to a unique prediction, and each column provides information about what is being predicted, its scope, and its value. Per hubverse convention, there are two groups of columns providing metadata about the prediction ³ ³When using models for downstream analysis with the collect_hub() function in the hubData package, one more column called model_id is prepended that identifies the model from its filename., followed by a value column with the actual output. Each group of columns serves a specific purpose: (1) the “task ID” columns provide details about what is being predicted, and (2) the two “model output representation” columns specify the type of prediction and identifying information about that prediction. Finally, (3) the value column provides the model output of the prediction. Details about the column specifications can be found below.

Details about model output column specifications#

As shown in the model output submission table above, there are three “task ID” columns: origin_epiweek, target, and horizon; and there are two “model output representation” columns: output_type and output_type_id followed by the value column. More detail about each of these column groups is given in the following points:

“Task IDs” (multiple columns): The details of the outcome (the model task) are provided by the modeler and can be stored in a series of “task ID” columns as described in this section on task ID variables. These “task ID” columns may also include additional information, such as any conditions or assumptions used to generate the predictions. Some example task ID variables include target, location, reference_date, and horizon. Although there are no restrictions on naming task ID variables, we suggest that hubs adopt the standard task ID or column names and definitions specified in the section on usage of task ID variables when appropriate.
“Model output representation” (2 columns): consists of two columns specifying how the model outputs are represented. Both of these columns will be present in all model output data:
1. output_type specifies the type of representation of the predictive distribution, namely "mean", "median", "quantile", "cdf", "cmf", "pmf", or "sample".
2. output_type_id specifies more identifying information specific to the output type, which varies depending on the output_type.
value contains the model’s prediction.

The following table provides more detail on how to configure the three “model output representation” columns based on each model output type.

Relationship between the three model output representation columns with respect to the type of prediction (`output_type`)#
`output_type`	`output_type_id`	`value`
`mean`	`NA`/`None` (not used for mean predictions)	Numeric: the mean of the predictive distribution
`median`	`NA`/`None` (not used for median predictions)	Numeric: the median of the predictive distribution
`quantile`	Numeric between 0.0 and 1.0: a probability level	Numeric: the quantile of the predictive distribution at the probability level specified by the output_type_id
`cdf`	String or numeric: a possible value of the target variable	Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id
`pmf`	String naming a possible category of a discrete outcome variable	Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable.
`sample`	Positive integer sample index	Numeric: a sample from the predictive distribution.

Note

The model output type IDs have different caveats depending on the output_type:

mean and median: Point estimates do not have an output_type_id because you can only have one point estimate for each combination of task IDs. However, because the output_type_id column is required, something has to go in this place, which is a missing value. This is encoded as NA in R and None in Python. See The example on writing parquet files for details.
pmf: Values are required to sum to 1 across all output_type_id values within each combination of values of task ID variables. This representation should only be used if the outcome variable is truly discrete; a CDF representation is preferred if the categories represent a binned discretization of an underlying continuous variable.
sample: Depending on the hub specification, samples with the same sample index (specified by the output_type_id) may be assumed to correspond to a single sample from a joint distribution across multiple levels of the task ID variables — further details are discussed below.
cdf (and pmf for ordinal variables): In the hub’s tasks.json configuration file, the values of the output_type_id should be listed in order from low to high.

Writing model output to a hub#

The model output follows the specification of the tasks.json configuration file of the hub. If you are creating a model and would like to know what data type your columns should be in, the Hubverse has utilities to provide an arrow schema and even a full submission template from the tasks.json configuration file.

When submitting model output to a hub, it should be placed in a folder with the name of your model_id in the model outputs folder specified by the hub administrator (this is usually called model-output). Below are R and Python examples for writing Hubverse-compliant model output files in both CSV and parquet format. In these examples, we are assuming the following variables already exist:

hub_path is the path to the hub cloned on your local computer
model_id is the combination of <team_abbr>-<model_abbr>
file_name is the file name of your model formatted as <round_id>-<model_id>.csv (or .parquet)
model_out is the tabular output from your model formatted as specified in the formats of model output section.

Submission Template#

The hubverse package hubValidations has functionality that will generate template data to get you started. This submission template can be written as a CSV or parquet file and then imported in to whatever software you use to run your model.

Here is some example code that can help. In this example, hub_path is the path to the hub on your local machine.

# read the configuration file and get the latest round
config_tasks <- hubUtils::read_config(hub_path)
rounds <- hubUtils::get_round_ids(config_tasks)
this_round <- rounds[length(rounds)]

# create the submission template (this may take some time if your submission uses samples)
tmpl <- hubValidations::submission_tmpl(config_tasks = config_tasks, round_id = this_round)

You can then either write this template to a csv file with the readr package:

# write the template to a csv file to use in your model code.
readr::write_csv(tmpl, "/path/to/template.csv")

OR you can write it to a parquet file with the arrow package:

# write the template to a parquet file to use in your model code.
arrow::write_parquet(tmpl, "/path/to/template.parquet")

Example: model output as CSV#

The sections below provide examples for writing CSV model output files. A note that missing data in a CSV file should be either a blank cell (that is, two adjacent commas ,,) or NA without quotes ⁴ ⁴You can quote me on this: No quotes. (e.g. ,NA,).

Writing CSV with R#

When writing a model output file in R, use the readr package.

# ... generate model data ...
outfile <- fs::path(hub_path, "model-output", model_id, file_name)
readr::write_csv(model_out, outfile)

Writing CSV with Python#

This example uses the pandas package when creating CSV model output files.

import pandas as pd
import os.path

# ... generate model data ...
outfile = os.path.join(hub_path, "model-output", model_id, file_name)
model_out.to_csv(outfile, index = False)

Example: model output as parquet#

Unlike a CSV, a parquet files contain embedded information about the data types of its columns. Therefore, when writing model output files as parquet, it’s critical that you first ensure the data type of your columns matches the expected type from the Arrow schema.

If the data types of the model output parquet file don’t match the hub’s schema, the submission will not validate. In practice, you will need to know whether or not the expected data type is a string/character, float/numeric, or an Int/integer.

Arrow Schema#

The hubverse packages hubData and hubUtils have functionality that will generate an arrow schema so that you can ensure your output matches the expected type.

Here is some example code that can help. In this example, hub_path is the path to the hub on your local machine.

# read the configuration file and get the latest round
config_tasks <- hubUtils::read_config(hub_path, "tasks")
schema <- hubData::create_hub_schema(config_tasks)

The schema output will look something like this:

Schema
origin_date: date32[day]
target: string
horizon: int32
location: string
age_group: string
output_type: string
output_type_id: double
value: int32
model_id: string

Writing parquet with R#

You can use the hubData::coerce_to_hub_schema(), function to ensure your data is in the correct format before writing out.

# ... generate model data ...
outfile <- fs::path(hub_path, "model-output", model_id, file_name)

# coerce model output data to the data types of the hub schema
config_tasks <- hubData::read_config(hub_path, "tasks")
model_out <- hubData::coerce_to_hub_schema(model_out, config_tasks)

# write to parquet file
arrow::write_parquet(model_out, outfile)

Writing parquet with Python#

his example uses the pandas package to create parquet files. Importantly, if you are creating a parquet file, you will need to ensure your column types match the hub schema. You can do this by using the astype() method for pandas DataFrames ⁵ ⁵If you prefer to use polars for your model output, you would use the polars cast() method.

import pandas as pd
import os.path
# ... generate model data ...
outfile = os.path.join(hub_path, "model-output", model_id, file_name)

# update the output_type_id data type to match the hub's schema
model_out["output_type_id"] = model_out["output_type_id"].astype("float") # or "string", or "Int64"
model_out.to_parquet(outfile)

Model output relationships to task ID variables#

We emphasize that the mean, median, quantile, cdf, and pmf representations all summarize the marginal predictive distribution for a single combination of model task ID variables. In contrast, we cannot assume the same for the sample representation. By recording samples from a joint predictive distribution, the sample representation may capture dependence across combinations of multiple model task ID variables.

For example, suppose the model task ID variables are “forecast date”, “location”, and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location, and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task ID variables, including:

the joint predictive distribution across all locations and horizons within each forecast date
the joint predictive distribution across all horizons within each forecast date and location
the joint predictive distribution across all locations within each forecast date and horizon
the marginal predictive distribution for each combination of forecast date, location, and horizon

Hubs should specify the collection of task ID variables for which samples are expected to capture dependence; e.g., the first option listed above might specify that samples should be drawn from distributions that are “joint across” locations and horizons.

More details about sample-output-type can be found in the page describing sample output type data.

Omitted output types#

Some other possible model output representations have been proposed but not included in the list above. We document these other proposals and the reasons for their omissions here:

Point estimates#

Bin probability. Two notes:
- If the bins have open left endpoints and closed right endpoints, bin probabilities can be calculated directly from CDF values.
- We considered a system with a more flexible specification of bin endpoint inclusion status but noted two disadvantages:
  - This additional flexibility would introduce substantial extra complexity to the metadata specifications and tooling
  - Adopting limited standards that encourage hubs to adopt settings consistent with the common definitions might be beneficial. For example, if a hub adopted bins with open right endpoints, the resulting probabilities would be incompatible with the conventions around cumulative distribution functions.
Probability of “success” in a setting with a binary outcome
- This can be captured with a CDF representation if the outcome variable is ordered or a categorical representation if the outcome variable is not.
Compositional. For example, we might request a probabilistic estimate of the proportion of hospitalizations next week due to influenza A/H1, A/H3, and B.
- Note that a categorical output representation could be used if only point estimates for the composition were required.

Validating prediction values#

Before model outputs can be incorporated into a hub, they must be validated. If a hub is centrally stored on GitHub, validation checks will be automatically performed for each submission (via the validate_pr() function from the hubValidations R package).

Teams can also validate their submissions locally via the function validate_submissions() from the hubValidations R package, which performs two validation tasks:

Validation based on rules that can easily be encoded in the JSON schema, such as ranges of expected values and output_type_ids.
Validation of more involved rules that cannot be encoded in a json schema are implemented separately (such as specific relationships between outputs and targets). You can find a table with the details of each check in the validate_submission() documentation and the validate_pr() documentation.

The importance of a stable model output file schema#

NOTE

The following discussion addresses two different types of schemas:

hubverse schema—the schema used for validating hub configuration files
Arrow schema—the mapping of model output columns to Arrow data types.

This section concerns parquet files, which encapsulate a schema within the file, but the broader issues have consequences for all output file types.

Model output data are stored as separate files, but we use the hubData package to open them as a single Arrow dataset. ⁶ ⁶Even if you do not use hubData to read model outputs, uniform schemas are still important if you want to join model output files and do analyses across submissions. It is necessary to ensure that all files conform to the same Arrow schema (i.e., share the same column data types) across the hub’s lifetime. When we know that all data types conform to the Arrow schema, we can be sure that a hub can be successfully accessed and is fully queryable across all columns as an Arrow dataset Thus, additions of new rounds should not change the overall hub schema at a later date (i.e., after submissions have already started being collected).

Many common task IDs should have consistent and stable data types because they are validated against the task IDs in the hubverse schema during model submission. However, there are several situations where a single consistent data type cannot be guaranteed, e.g.:

New rounds introducing changes in custom task ID value data types, which are not covered by the hubverse schema.
New rounds introducing changes in task IDs covered by the schema but which accept multiple data types (e.g., scenario_id where both integer and character are accepted or age_group where no data type is specified in the hubverse schema).
Adding new output types might introduce output_type_id values of a new data type.

While config file validation will alert hub administrations to discrepancies in task ID value data types across modeling tasks and rounds, modifications that change the overall data type of model output columns after submissions have been collected could cause downstream issues and should be avoided. Changing the overall data type of model output columns can cause a range of issues (in order of increasing severity):

data type casting being required in downstream analysis code that used to work,
not being able to filter on columns with data type discrepancies between files before collecting
errors when opening hub model output data with popular analytics tools like Arrow, Pandas, and Polars

The `output_type_id` column data type#

Output types are configured and handled differently than task IDs in the hubverse.

On the one hand, different output types can have output type ID values of varying data type, and adhering to these data types is imposed by downstream, output type-specific hubverse functionality like ensembling or visualization. For example, hubs expect double output type ID values for quantile output types but character output type IDs for a pmf output type.

On the other hand, the use of a long format for hubverse model output files requires that these multiple data types are accommodated in a single output_type_id column. This characteristic makes the output type ID column unique within the model output file in terms of how its data type is determined, configured, and validated.

During submission validation, two checks are performed on the output_type_id column:

Subsets of output_type_id column values associated with a given output type are checked for being able to be coerced to the correct data type defined in the config for that output type. This check ensures that correct output type–specific downstream data handling is possible.
The overall data type of the output_type_id column matches the overall hub schema expectation.

Determining the overall `output_type_id` column data type automatically#

To determine the overall output_type_id data type, the default behavior is to automatically detect the simplest data type that can encode all output type ID values across all rounds and output types from the config.

The benefit of this automatic detection is that it provides flexibility to the output_type_id column to adapt to the output types a hub is collecting. For example, a hub that only collects mean and quantile output types would, by default, have a double output_type_id column.

However, the risk of this automatic detection arises if the hub also starts collecting a pmf output type after submissions have begun in subsequent rounds. If this happens, it would change the default output_type_id column data type from double to character and cause a conflict between the output_type_id column data type in older and newer files when trying to open the hub as an arrow dataset.

Fixing the `output_type_id` column data type with the `output_type_id_datatype` property#

To enable hub administrators to configure and communicate the data type of the output_type_id column at a hub level, the hubverse schema allows for using an optional output_type_id_datatype property. This property should be provided at the top level of tasks.json (i.e., sibling to rounds and schema_version) and can take any of the following values: "auto", "character", "double", "integer", "logical", "Date" and can be used to fix the output_type_id column data type.

{
  "schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v*/tasks-schema.json",
  "rounds": [...],
  "output_type_id_datatype": "character"
}

If not supplied or if "auto" is set, the default behavior of automatically detecting the data type from output_type_id values is used.

This feature gives hub administrators the ability to future-proof the output_type_id column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema by setting the column to "character" (the safest data type that all other values can be encoded as) at the start of data collection.