Large correlation matrix data require excessive CPU and memory resources when browsing

GraemeWatt · 26 November 2021 19:38

Thanks for the interesting questions. Unfortunately, the situation has not changed since I answered similar questions in an email to you last year (12th June 2020). The current heatmap visualisation code does not cope well for tables with more than, say, 5000 rows, corresponding to a covariance matrix with 50-100 bins. Suppressing the loading of large tables by default (records: initially only load and plot the first 50 rows of large tables · Issue #136 · HEPData/hepdata · GitHub) or investigating the use of alternative (possibly more efficient) visualisation libraries (plots: investigate use of Vega-Lite or Altair for visualization · Issue #151 · HEPData/hepdata · GitHub) are long-term open issues that are not easy to address. Conversion of large matrices from YAML into other formats may also present a problem.

I agree that the current YAML encoding of covariance matrices is quite verbose, with duplicated bins, but it has the advantage of using the same representation for two-dimensional measurements with different bins for each of the two independent variables. Using a custom YAML format is not currently possible.

The current solution to both these problems is to include a large covariance matrix not as a data table, but as additional_resources attached to either a whole submission or to a specific table. If the latter, the table could contain the central values of the measurement in the normal YAML format, then the covariance matrix could be attached as additional_resources. Or an empty table could be given with {independent_variables: [], dependent_variables: []} and the covariance matrix could be attached as additional_resources. If using this method, you have the freedom to specify the covariance matrix in any format you choose, for example, a more concise YAML format or a simple text format such as CSV. The hepdata_lib tool has a method add_additional_resource for instances of the Submission or Table classes.

I realise that this solution is not ideal, but HEPData was not designed to support large files or custom formats, and it is not easy to extend the current software. But if you manage to develop a concise YAML format for covariance matrices similar to the HEPData format, we could consider extending the software to support it in future, and you can attach the files as additional_resources for now.