## Warning: package 'knitr' was built under R version 3.4.3
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Exploratory analysis.

I started the analysis exploring the distribution of p-values across all variables. I checked if there is a clustering of p-values around the significance threshold of p = 0.05. The theory of p-curve predicts that naturally occurring p-values are equally distributed. With the exception that in the presence of an effect of the independent variable, the distribution will grow exponentially as it approaches 0.01.

In the first place, the graph above shows an exponential grow approaching 0.00, which is expected in studies that are exploring a true effect. In the case of a non-effect, it is expected that the distribution of p-values is equal and projects a flat distribution. More interestingly is the fact that it seems to be clustering of p-values around absolute values such as 0.01, 0.02, 0.03, a 0.04., even when the graph show a distribution of p-values with generally a true effect, it is expected that the distribution of p-values across the plot to be equal. Thus, this clustering near absolute values could be indicative of heave rounding of p-values by researchers, which is an indication of p-hacking. Even when p-hacking research tends to focus on values near p = 0.05, this clustering surrounding absolute p-values represent an interesting phenomenon to explore.

P-values frequency by intervals
p-values Intervals Frequency
[0,0.005] 40221
(0.005,0.01] 19462
(0.01,0.015] 7535
(0.015,0.02] 12827
(0.02,0.025] 5048
(0.025,0.03] 10368
(0.03,0.035] 4039
(0.035,0.04] 8600
(0.04,0.045] 3489
(0.045,0.05] 3130

The table above shows p-value intervals. We can see a clustering of p-values the intervals from 0 to 0.005 with 40,221; 0.005 to 0.01 with 19,462; 0.015 to 0.02 with 12827; 0.025 to 0.03 with 10,368; and 0.035 to 0.04 with 8600. It interesting that in the plot and table above there is not a significant clustering of p-values between the interval 0.04 and 0.05.

I continued the exploratory analysis examining the distribution of p-values across different categories. However, because the data set has more than 22 categories it was difficult to create a graph avoiding clustering of values. Thus, I selected the more frequent categories and graphed those.

The graph above shows the most frequent categories as Medical and Health Sciences and Multidisciplinary Studies. Similarly, to the distribution of all p-values, the distribution of p-values indicates that the studies explore a true effect phenomenon, that is reaffirming. Furthermore, there is a similar clustering of around absolute p-values independent of the category. Finally, in this plot by categories, it is possible to see a small clustering of p-values just bellow p = 0.05. Clustering that was not visible in the graph of all p-values. It is interesting that a similar small clustering of p-values is present between all mayor absolute p-values such as 0.01 and 0.02. Perhaps, that is the true distribution of p-values, and the peaks represent p-hacking.

In the context of the public PubMed database, which goes from 1997 to 2014, there has been a significant increase in the number of studies that reported a p-value either in the results section or the abstract. This increase of p-value reporting is particularly strong after 2010. Similarly, it is interesting the sharp decrease of p-value reported after 2013, but this might be related to the number of articles added to the database and not to the number of articles published in PubMed or the number of articles reporting p-values.

One of the aspects explored by David Chavalarias et al., (2016) is the frequency of p-values reported in both the abstracts and the result section. Similarly to the authors’ finding, my exploration finds that most p-values are reported in the body of the articles, in particular in the results section.

Besides the number of p-values reported, it is clear in the plot that the distribution of p-values reported in abstracts does not significantly differ from the distribution of p-values reported in the result section. The larger among of p-values reported in the result section could be related to the fact that researchers have the tendency of reporting only strong p-values in abstracts while weaker ones are reported in the result section only.

Binomial Tests.

Results P-values probability in intervals 0.03 and 0.04
Bins Successes Trials Probabilty
0.03 11977 10337 4.929508e-28
0.04 2397 2755 6.503877e-07

First, I wanted to see the probability of getting a p-value from the result section in the intervals 0 > p <= 0.03 and 0.03 => p <= 0.04. In the case of the result section, the probability of getting a p-value in the interval 0 > p <= 0.03 is 4.92 percent. Similarly the probability of getting a p-value in the interval 0.03 => p <= 0.04 is 6.50 percent. The probability of getting a p-value in the lower bin (0 to 0.03) is significantly lower than the probability of getting a p-value in the bin closer to the threshold p = 0.05. So, I reject the hypothesize that the frequency of p-values in the bins just below the threshold p= 0.05 will be similar and the frequency of p-values will increase as the p-value approaches 0.01. This rejection of the hypothesis is an indication of a moderate p-hacking.

Results P-values probability in intervals 0.03 and 0.04
Bins Successes Trials Probabilty
0.03 1784 1536 1.797859e-05
0.04 348 375 3.335726e-01

In the case of the abstract section, the probability of getting a p-value in the intervals 0 > p <= 0.03 and 0.03 => p <= 0.04 is 1.79 and 3.33 respectively. Thus, similarly, I reject the hypothesis. Concluding that the test shows p-hacking around the threshold of p = 0.05.

Finally, following Head M. et al, (2015), I conducted a binomial test by categories to determine if the p-haling detected in the binomial test across all categories is present in particular categories. The color of the market represent the sample size.

Citations

Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The Extent and Consequences of P-Hacking in Science. PLOS Biology, 13(3), e1002106–15. http://doi.org/10.1371/journal.pbio.1002106