By Amanda Fernández-Fontelo, Alejandra Cabaña, Argimiro Arratia, David Moriña, Pere Puig (HU-UAB-UPC)



Further information through https://github.com/underreported/COVID19_UR


The present outbreak of COVID-19 disease, caused by the SARS-CoV-2 virus, has put the planet in quarantine. On January 30, 2020, the World Health Organization (WHO) declared the COVID-19 outbreak a “public health emergency of international concern”, and then a pandemic on March 11.

Spain has become the fifth country worldwide with more infected cases, officially registering over thousands of cases in a short time. Although many critical and severe measures have been considered from the authorities to lessen the impact of the outbreak and help flatten the curve, they rely on numbers that could be unreliable and therefore misrepresent the implications of such pandemic.

Counts in Spain due to the protocols used for testing, mainly include individuals with severe symptoms. The authorities have juste announced a new protocol with rapid tests to be implementend in a few days elpais.com.

Given the nature of our data, we can guess that the estimated number of cases that we are finding are in fact potentially severe cases, and presumably the size of the infected population (asymptomatic) is even higher.

Accordingly, the current analysis aims to update the situation concerning COVID-19 daily, and particularly quantify the potential under-reporting in the official registered cases by region in Spain. Results herein can help to have a more realistic picture of the pandemic at a real time as well as to more accurately estimate essential measures such as the basic reproduction number or the fatality rate that are used for practitioners and politicians to make decisions.

The data for the analysis have been extracted from eldiario.es, where official data are gathered.

Notice that this analysis can be easily reproduced for other countries.

Table 1: Summary of the daily COVID-19 cases from 27-02-20 to 19-03-2020 by region in Spain
minimum mean median maximum standard deviation dispersion index
Andalucia 0 45.82 12.5 176 61.99 83.86
Aragon 0 12.82 5.0 67 18.37 26.34
Asturias 0 13.27 2.0 50 17.62 23.39
Baleares 0 7.68 1.5 57 15.19 30.02
Canarias 0 10.00 5.5 39 11.81 13.95
Cantabria 0 3.77 0.0 20 5.92 9.29
Castilla La Mancha 0 43.95 8.0 166 58.25 77.20
Castilla Leon 0 39.45 9.5 237 64.78 106.37
Extremadura 0 10.95 1.0 47 15.98 23.31
Galicia 0 20.59 2.0 112 31.25 47.43
La Rioja 0 21.27 17.5 64 20.54 19.84
Murcia 0 7.59 2.5 45 11.54 17.55
Navarra 0 21.91 1.5 96 31.03 43.93
Pais Vasco 0 67.00 38.0 217 79.95 95.41
Valencia 0 54.55 7.0 279 93.26 159.46

If the under-reporting is ignored, the daily counts can be appropriately modeled following: \(exp(\alpha_0 + \alpha_1t)\), since the number of daily COVID-19 cases overtime properly growths exponentially according to Figure 1. At the moment, there are no evidences of a seasonal behaviour of SARS-CoV-2 virus, unlike the MERS-cov (Alkhamis, Fernández-Fontelo et al., 2018).

However, if we consider that the official number of daily cases does not reflect the total number of cases (e.g., a proportion of the cases is not observed, and thus the data are misreported), the model above does not make any sense, and therefore a more appropriate alternative should be considered.

We shall base all the subsequent analysis in a model introduced by Fernández-Fontelo, Cabaña et al. (2016). We have also applied a similar methodology in Fernández-Fontelo, Cabaña et al. (2019) and in other papers submitted for publication (Moriña, Fernández-Fontelo et al. (2020a) and Moriña, Fernández-Fontelo et al. (2020b)).

In that model, two different processes are considered: \(X_n\) which is the true process but unobserved (latent), and \(Y_n\) which is observed and potentially under-reported. In this application, the latent process is assumed to be Poisson distributed with time-dependent rate, \(\lambda_t=exp(\beta_0 + \beta_1t)\). The observed process will always be lower or equal than the latent process (due to the under-reporting) in such a way that \(Y_n\) will be equal than \(X_n\) (non under-reporting) with probability \(1-\omega\), or \(Y_n\) is \(q \circ X_n\) with probability \(\omega\). Parameters \(\omega\) and \(q\) quantify the overall frequency and intensity of the phenomenon, which roughly speaking describe respectively the number of times the observed counts are not equal to the real ones, and the distance between the real and observed processes.

Table 2: Estimates of under-reporting parameters by region in Spain
\(\beta_0\) \(\beta_1\) \(\omega\) \(q\) AIC
Andalucia 0.3279 0.2479 0.8572 0.6122 212.7
s.e. (Andalucia) 0.1728 0.0087 0.0883 0.0211
Aragon -0.3013 0.2046 0.3709 0.2282 166.5
s.e. (Aragon) 0.33 0.0175 0.1347 0.0526
Asturias -0.3282 0.2051 0.4396 0.2662 134.1
s.e. (Asturias) 0.3922 0.0202 0.1492 0.0684
Baleares -0.7201 0.2237 0.8452 0.2877 120
s.e. (Baleares) 0.6296 0.0294 0.1078 0.0559
Canarias 0.0592 0.1659 0.3799 0.284 121.6
s.e. (Canarias) 0.4031 0.021 0.1475 0.0848
Cantabria 1.8292 0.0344 0.6844 0.0411 101.5
s.e. (Cantabria) 0.4494 0.0244 0.1019 0.022
Castilla La Mancha -0.1119 0.2631 0.5525 0.478 179.6
s.e. (Castilla La Mancha) 0.1969 0.0102 0.1294 0.0261
Castilla Leon -0.6833 0.2945 0.7692 0.5552 164.8
s.e. (Castilla Leon) 0.2506 0.0121 0.1235 0.0286
Extremadura 0.0904 0.173 0.518 0.0471 108.7
s.e. (Extremadura) 0.4445 0.0229 0.114 0.0292
Galicia -0.8934 0.2656 0.5915 0.4145 171.8
s.e. (Galicia) 0.3198 0.0162 0.1364 0.0478
La Rioja 2.1865 0.0921 0.5259 0.2519 208.8
s.e. (La Rioja) 0.2126 0.0118 0.1116 0.0357
Murcia -2.2174 0.2724 0.1708 0.2922 87.9
s.e. (Murcia) 0.4695 0.024 0.1459 0.1265
Navarra -0.7687 0.2661 0.7357 0.5638 186.3
s.e. (Navarra) 0.262 0.0133 0.1173 0.0353
Pais Vasco 1.2253 0.2143 0.7241 0.6253 260.2
s.e. (Pais Vasco) 0.141 0.0072 0.1089 0.0196
Valencia 4.4696 0.0447 0.7727 0.0572 377.4
s.e. (Valencia) 0.2823 0.014 0.0893 0.0096

Using the Viterbi algorithm, the model also enables reconstructing the most likely sequence of real COVID-19 cases throughout the study. This allows us to have an estimated time series of truly daily cases and evaluate the impact of under-reporting over measures such as the basic reproduction number. Figure 2 shows the observed and reconstructed series over time by region.

Table 3 shows the percentages of means counts that are not covered by the official registers. Thus, the highest the rate, the lower is the coverage, and therefore the severe is the impact of the under-reporting.

Table 3: Estimate mean of non-coverage of cases of COVID-19 in Spain
observed mean true mean % not covered
Andalucia 29.1579 46.0000 36.61
Aragon 9.2105 19.5789 52.96
Asturias 9.3158 11.7368 20.63
Canarias 6.2632 7.8421 20.13
Cantabria 3.0526 9.0526 66.28
Castilla Leon 17.5789 23.4211 24.94
Catalunya 47.5263 57.8421 17.83
Extremadura 5.8421 9.0000 35.09
Galicia 12.8947 19.5789 34.14
La Rioja 16.4211 27.6842 40.68
Madrid 219.2105 306.0526 28.37
Navarra 14.4211 18.3158 21.26
Pais Vasco 48.1053 60.0000 19.82

Modeling the epidemic with the under-reported data

It is instructive to see what the difference would be on epidemic spread by fitting an epidemic model to the reconstructed series of counts and the observed counts recorded by public agencies. We fit the classic SIR (Susceptible-Infectious-Recovered) model. Table 4 shows the basic reproduction rate by using the reconstructed series ( \(R_{0E}\) ) and the observed ( \(R_{0R}\) ).

The dynamics of the spread of the virus in the SIR model is described by the following differential equations:

\(\frac{dS}{dt} = -\beta \frac{IS}{N} \)

\(\frac{dI}{dt} = \beta \frac{IS}{N}- \gamma I \)

\(\frac{dR}{dt} = \gamma I \)

where the parameters are \(\beta\) , the infection rate, and \(\gamma\) , the recovery rate, and \(N\) is the total population.

We seek the values of \(\beta\) and \(\gamma\) that minimizes de residual sum of squares (RSS) between the number of infected individuals and the corresponding number of cases as predicted by the model at any time.

Once the values of \(\beta\) and \(\gamma\) are known we can compute the important basic reproduction number: \(R_0=\beta/\gamma\) . This number \(R_0\) gives us an estimate of the average number of susceptibles individuals who are infected by each infected individual.

Table 4: Basic reproduction rates
\(\beta_E\) \(\gamma_E\) \(R_{0E}\) \(\beta_R\) \(\gamma_R\) \(R_{0R}\)
Andalucia 0.3036 0.0000 Inf 0.6382 0.3618 1.7640
Aragon 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Asturias 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Canarias 0.5314 0.4686 1.1341 0.5213 0.4787 1.0890
Cantabria 0.5685 0.4315 1.3174 0.5000 0.5000 1.0000
Castilla Leon 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Catalunya 0.6215 0.3785 1.6420 0.6189 0.3811 1.6243
Extremadura 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Galicia 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
La Rioja 1.0000 0.8147 1.2274 0.5000 0.5000 1.0000
Navarra 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Pais Vasco 0.5000 0.5000 1.0000 0.5000 0.5000 1.0000
Madrid 1.0000 0.6685 1.4957 0.6509 0.3490 1.8648