Large correlation matrix data require excessive CPU and memory resources when browsing

lusiani · 25 November 2021 23:23

I am working on a submission of a BaBar measurement of sigma(e+e- → pi+pi-) that includes a large (337 x 337) statistical covariace matrix. Preparing and uploading the submission requires few CPU resources, but when the table is displayed on a browser even a modern 8-threads CPU with 16 GB of RAM gets overloaded for a significant amount of time, apparently to produce and display a scatter plot of the covariance matrix. I work on Linux Fedora 34 and use Chrome as browser.

Is there a way to build a submission in a way that one table has its automatic plot disabled when browsing?

Although there are no resource constraints, I also note that the YAML format for storing a covariance matrix is verbose, and requires in my case to store each coefficient with the lower and higher edges of its row and of its column bin. A more compact format could store all coefficients without extra data, plus a single instance of the bin edges. I have read the documentation but I did not find an example of how to include data in a custom format: is that possible?

Best regards,

Alberto Lusiani

GraemeWatt · 26 November 2021 19:38

Thanks for the interesting questions. Unfortunately, the situation has not changed since I answered similar questions in an email to you last year (12th June 2020). The current heatmap visualisation code does not cope well for tables with more than, say, 5000 rows, corresponding to a covariance matrix with 50-100 bins. Suppressing the loading of large tables by default (records: initially only load and plot the first 50 rows of large tables · Issue #136 · HEPData/hepdata · GitHub) or investigating the use of alternative (possibly more efficient) visualisation libraries (plots: investigate use of Vega-Lite or Altair for visualization · Issue #151 · HEPData/hepdata · GitHub) are long-term open issues that are not easy to address. Conversion of large matrices from YAML into other formats may also present a problem.

I agree that the current YAML encoding of covariance matrices is quite verbose, with duplicated bins, but it has the advantage of using the same representation for two-dimensional measurements with different bins for each of the two independent variables. Using a custom YAML format is not currently possible.

The current solution to both these problems is to include a large covariance matrix not as a data table, but as additional_resources attached to either a whole submission or to a specific table. If the latter, the table could contain the central values of the measurement in the normal YAML format, then the covariance matrix could be attached as additional_resources. Or an empty table could be given with {independent_variables: [], dependent_variables: []} and the covariance matrix could be attached as additional_resources. If using this method, you have the freedom to specify the covariance matrix in any format you choose, for example, a more concise YAML format or a simple text format such as CSV. The hepdata_lib tool has a method add_additional_resource for instances of the Submission or Table classes.

I realise that this solution is not ideal, but HEPData was not designed to support large files or custom formats, and it is not easy to extend the current software. But if you manage to develop a concise YAML format for covariance matrices similar to the HEPData format, we could consider extending the software to support it in future, and you can attach the files as additional_resources for now.

lusiani · 27 November 2021 20:37

Thanks for the reply! I will try to use additional_resources.