Background

  • Preprints are complete, but unpublished, manuscripts that have not gone through the peer-review process.¹ Preprint servers, such as medRxiv and bioRxiv, enable researchers to disseminate findings and to receive feedback before submitting their manuscript to a peer-reviewed journal.
    • bioRxiv, introduced in 2013, is an established free online archive and distribution service for life sciences unpublished manuscripts (preprints).
    • medRxiv, introduced in 2019, is a free online archive and distribution server for medical, clinical and related health sciences unpublished manuscripts (preprints).
  • The use of the medRxiv preprint database by pharmaceutical companies has been previously described.² However, it remains unclear how these usage trends compare with those of the established bioRxiv database.
Objective
  • To understand and compare the use of medRxiv versus bioRxiv preprint databases by pharmaceutical companies.

Methods

  • Key data on preprints from June 25, 2019 to January 04, 2021, and from January 01, 2013 to January 04, 2021 were extracted from the medRxiv and bioRxiv databases, respectively, using the medrxivr R package for both databases. Preprint data from bioRxiv from June 25, 2019 to January 04, 2021 were also extracted and reported.
  • Pharma-affiliated preprints were defined as having at least one author affiliated with a top 50 pharmaceutical company;³ these preprints were identified by assessing the byline position of the pharma-affiliated author (first, last, or middle authors).
  • Authors’ affiliations, research topics, number of versions, publication status, time to publication, and copyright licence information were extracted for all preprints and for pharma-affiliated author preprints only.
    • Authors’ affiliations were extracted using the rVest web-scraping package. To quantify the numbers of pharma-affiliated preprints, Wilcoxon signed-rank tests were conducted in the R software environment using the ‘wilcox.test’ function for paired data.
    • To investigate the potential influence of the COVID-19 pandemic on preprint research topics, an additional search was conducted to quantify the number of COVID-19-related preprints that included any of the following key words in the abstract text: covid, coronavirus, SARS-CoV2, nCov, Covid and Coronavirus.
    • The number of versions (i.e., number of revisions) were extracted as medRxiv metadata.
    • The time to publication was analysed using additional data accessed via the CrossRef application programming interface (API).
  • Additionally, Altmetric data and tweet counts were assessed for preprints and full publications; these data were accessed via the Altmetric API using the rAltmetric library.

Results

Number of preprints
Most medRxiv and bioRxiv preprints were not pharma-affiliated research.


medRxiv preprints

0.8%
(n = 127/15,018) of all preprints had at least one author affiliated with a top 50 pharmaceutical company.

0.1% (n = 21/15,018) had a pharmaceutical company employee as first or last author.

bioRxiv preprints

Posted during the period 2019–2021


0.9% (n = 508/58,106) of all preprints had at least one author affiliated with a top 50 pharmaceutical company.

0.2% (n = 107/58,106) had a pharmaceutical company employee as first or last author.

Posted during the period 2013–2021


0.8% (n = 894/107,489) of all preprints had at least one author affiliated with a top 50 pharmaceutical company.

0.2%
(n = 193/107,489) had a pharmaceutical company employee as first or last author.
Figure 1. medRxiv (2019–2021)
Figure 2. bioRxiv (2019–2021)
Figure 3. bioRxiv (2013–2021)
medRxiv was launched mid-2019; most preprints were deposited during the2020 calendar year, coinciding with the release of a large volume of information relating to the COVID-19 pandemic. Data labels without corresponding data bars represent "0" values.
Topics
most common topics were "infectious diseases" and "epidemiology" for medRxiv preprints and "neuroscience" for bioRxiv preprints.


medRxiv preprints

25.7%
(n = 3864/15,018) and 21.9% (n = 3291/15,018) of all preprints and 33.9% (n = 43/127) and 11.0% (n = 14/127) of pharma-affiliated preprints were on the topics "infectious diseases" and "epidemiology", respectively.

24.1% (n = 3616/15,018) of all preprints and 23.6% (n= 30/127) of pharma-affiliated preprints were related to COVID-19.

bioRxiv preprints

Posted during the period 2019–2021


18.5% (n = 10,758/58,106) of all preprints and 13.8% (n = 70/508) of pharma-affiliated preprints were on the topic "neuroscience".

2.7% (n = n =1,563/58,106) of all preprints and 3.9% (n = 20/508) of pharma-affiliated preprints were related to COVID-19.

Posted during the period 2013–2021


17.3% (n = 18,625/107,489) of all preprints and 13.4% (n = 120/894) of pharma-affiliated preprints were on the topic "neuroscience".

1.5% (n = 1,623/107,489) of all preprints and 2.2% (n = 20/894) of pharma-affiliated preprints were related to COVID-19.
Figure 4. medRxiv (2019–2021)
Figure 5. bioRxiv (2019–2021)
Figure 6. bioRxiv (2013–2021)
Data labels without corresponding data bars represent ‘0’ values. AIDS, acquired immunodeficiency syndrome; HIV, human immunodeficiency virus.
Number of versions
Most preprints had only one version.

medRxiv preprints

77.4% (n = 11,619/15,018) of all preprints had only one unrevised version posted.

79.5% (n = 101/127) of all pharma-affiliated preprints had only one version, and no pharma-affiliated preprints had more than four versions.

bioRxiv preprints

Posted during the period 2019–2021

73.6%
(n = 42,756/58,106) of all preprints had only one unrevised version posted.

75.0% (n = 381/508) of all pharma-affiliated preprints had only one version, and no pharma-affiliated preprints had more than four versions.

Posted during the period 2013–2021

73.5%
(n = 78,973/107,489) of all preprints had only one unrevised version posted.

73.7%
(n = 659/894) of all pharma-affiliated preprints had only one version, and no preprints had more than five versions.
Figure 7. medRxiv (2019–2021)
Figure 8. bioRxiv (2019–2021)
Figure 9. bioRxiv (2013–2021)
Data labels without corresponding data bars represent "0" values.
Time from latest preprint date to publication
Most preprints remained unpublished at the time of this study; the time from first preprint registration to publication was shorter for medRxiv than bioRxiv preprints.

medRxiv preprints

19.8%
(n = 2,971/15,018) of all preprints were published after a median 103 days (interquartile range [IQR]: 55–158 days; min–max: 0–521 days).

13.4%
(n = 17/127) of pharma-affiliated preprints were published after a median 112 days (IQR: 74–154 days; min–max: 27–289 days)

bioRxiv preprints

Posted during the period 2019–2021

30.8% (n = 17,894/58,106) of all preprints were published after a median 146 days (IQR: 77–198 days; min–max: 0–632 days).

28.0%
(n = 142/508) of pharma-affiliated preprints were published after a median 148 days (IQR: 85–213 days; min–max: 6–527 days)

Posted during the period 2013–2021

45.0%
(n = 48,374/107,489) of all preprints were published after a median 150 days (IQR: 86–233 days; min–max: 0–1,781 days).

44.6%
(n = 399/894) of pharma-affiliated preprints were published after a median 161 days (IQR: 100–255 days; min–max: 0–789 days).
Figure 10. medRxiv (2019–2021)
Figure 11. bioRxiv (2019–2021)
Figure 12. bioRxiv (2013–2021)
Figure 13. medRxiv (2019–2021)
Figure 14. bioRxiv (2019–2021)
Figure 15. bioRxiv (2013–2021)
Upper panels: the cut-off date for analysis was January 04, 2021. For pharma-affiliated preprints, the mean time to publication from first preprint registration was less for medRxiv (123.8 days) preprints than for bioRxiv preprints (188.0 days). The high proportion of unpublished preprints might be due to the high number of preprints posted after May 2020, which might still have been going through the publication process at the time the analysis was performed. Data labels without corresponding data bars represent "0" values.

Lower panels: the time to full publication for all preprints is shown as individual density plots, ordered by median, for journals with at least 10 articles. The remaining articles (medRxiv: 2,399 articles in 1,020 journals; bioRxiv (2019–2021): 13,046 articles in 1,813 journals; bioRxiv (2013–2021): 34,191 articles in 2,847 journals) are grouped under "Other". Vertical lines indicate median values. IJERPH, International Journal ofEnvironmental Research and Public Health; IJID, International Journal ofInfectious Diseases.
Copyright licence
The most common licence for all preprints, regardless of repository was CC-BY-NC-ND.
medRxiv preprints

42.2%
(n = 6,341/15,018) of all preprints had a CC-BY-NC-ND licence.

36.2%
(n = 46/127) of all pharma-affiliated preprints had a CC-BY-NC-ND licence.

bioRxiv preprints

Posted during the period 2019–2021

35.8%
(n = 20,794/58,106) of all preprints had a CC-BY-NC-ND licence.

36.8%
(n = 187/508) of all pharma-affiliated preprints had no associated creative commons licence (i.e., ‘CC-NO’).

Posted during the period 2013–2021

35.0%
(n =37,619/107,489) of all preprints had a CC-BY-NC-ND licence.

38.0% (n = 340/894) of all pharma-affiliated preprints had no associated creative commons licence (i.e., ‘CC-NO’).
Figure 16. medRxiv (2019–2021)
Figure 17. bioRxiv (2019–2021)
Figure 18. bioRxiv (2013–2021)
CC-BY, Creative Commons Attribution licence; CC-BY-NC, Creative Commons Attribution-NonCommercial licence; CC-BY-NC-ND, Creative Commons Attribution-NonCommercial-NonDerivative licence; CC-BY-ND, Creative Commons Attribution-NonDerivative licence; CC0, no rights reserved; CC-NO, the copyright holder for this preprint is the author/funder, who has granted medRxiv a licence to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission; CC0-NG, the copyright holder has placed this preprint in the public domain. It is no longer restricted by copyright. Anyone can legally share, reuse, remix or adapt this material for any purpose without crediting the original authors.
Reach and impact
Reach and impact were higher for subsequent peer-reviewed publications than for medRxiv and bioRxiv preprints.
Figure 19. medRxiv (2019–2021)
Figure 20. bioRxiv (2019–2021)
Figure 21. bioRxiv (2013–2021)
In total, 6,330 medRxiv preprints, 37,224 bioRxiv (2019–2021) preprints and 99,846 bioRxiv (2013–2021) preprints and their corresponding publications were included in the analysis. p values are from Wilcoxon signed-rank tests on medians (these consider each preprint–peer-reviewed publication pair). For an individual preprint or article, most of the altmetric attention score would be expected to accrue within the first few weeks of availability (after upload to medRxiv and publication, respectively). However, citations usually do not start to appear until 6–12 months afterwards, meaning that citation numbers for preprints and peer-reviewed publications are likely to increase over time. Nevertheless, citation numbers for peer-reviewed publications were significantly greater than those for the corresponding preprints.

Study limitations

  • The medRxiv database was launched in June 2019, whereas the bioRxiv database was launched in November 2013; therefore, there were substantially more bioRxiv preprints included in the overall analysis. To address this limitation, and to accurately compare the medRxiv and bioRxiv preprints data, we also analysed bioRxiv preprints that were posted during the period 2019–2021.
  • In this analysis, preprints that were funded by pharmaceutical companies (big or small), and did not contain a pharma-affiliated author, were not evaluated; therefore, the numbers of pharma-affiliated preprints identified may be underestimated.
  • Regarding copyright licenses, only data regarding the type of license was collected; therefore, it was not possible to determine whether the selected copyright license was an author’s choice or a journal requirement.

Conclusions

  • Only a small proportion of medRxiv and bioRxiv preprints were affiliated to pharmaceutical companies.
  • The proportions of pharma-affiliated medRxiv and bioRiv preprints with revisions were similar.
  • Between June 2019 and January 2021, less than 20% of all medRxiv preprints and less than 40% of all bioRxiv preprints were subsequently published in peer-reviewed journals.
    • There are many factors that could contribute to this finding that we were unable to define, such as long submission-to-publication lead times for peer-reviewed journals, and no submission to peer-reviewed journals.
  • Of those preprints that were subsequently published at the time of this analysis, the time from first registration to full publication was shorter for medRxiv preprints than for bioRxiv preprints.
  • Peer-reviewed publications maintain higher reach and impact than preprints alone, supporting the value and importance of the peer-review process.
/* --------------------------------------- Pie Chart ------------------------------------------ */