Different coefficients have been suggested in the literature, which can be used to quantify the agreement between two observers on a nominal scale (Gwet 2012; Hsu and Field 2003; Krippendorff 2004; Warrens 2010a). The most commonly used coefficient is Cohens Kappa (Cohen 1960; Crewson 2005; Fleiss et al. 2003; Sim and Wright, 2005; Gwet 2012; Warrens 2015). An alternative to Kappa is the B coefficient proposed by Bangdiwala (Bangdiwala 1985). Musoz and Bangdiwala 1997; Shankar and Bangdiwala 2008). The B coefficient can be deducted from a graphic representation called the chord diagram. It is defined as the ratio of the sum of square surfaces of the perfect match with the sum of the rectangle surfaces of the marginal totals of the chord diagram. Based on the conclusions presented in this paper, it is possible to make some thoughts on research policy. The first is on research evaluations in Italy. The results of the trials cannot be considered at all as validation of the use of the alternating evaluation method used by ANVUR. In the current state of knowledge, it cannot be ruled out that the use of the dual method has introduced large uncontrollable distortions in the final conclusions of the evaluations. In fact, bibliometry and peer review show little convergence.

In particular, data from official research reports [12, 17] show that the evaluators` results were on average lower than those of the library. Unbiased results at the aggregate level would only be achieved if the distribution of articles evaluated using both methods was homogeneous for the different evaluation units (research area, research area, departments and universities). According to official reports, the distribution was not homogeneous. Distributions by research area of articles with an inconclusive library score, which was therefore peer-reviewed, ranged from 0.9% to 26.5% in VQR1 (source: [12, Table 3.5]) and from 0.1% to 19.2% in VQR2 (source: [17, Table 3.5]). Therefore, aggregated results for research areas, departments and universities could be influenced by the proportion of research results evaluated by the two different techniques: the higher the percentage of research results evaluated in the peer review field, the lower the total. Publicly available data show that the average research score – more generally – is negatively related to the percentage of work evaluated in peer review.