The Vaccination Corpus is released by the Computational Lexicology and Terminology Lab (CLTL) and the Web & Media Group of the Vrije Universiteit Amsterdam to facilitate research on the online vaccination debate.
The corpus contains a variety of web documents (including news, blogs, editorial, governmental reports, science articles) around the vaccination debate. They represent supporting, opposing and neutral views with respect to vaccinations.
The set of documents was both manually and automatically collected from the Web. We have created a framework in which newly crawled web documents can easily be added. To ensure future accessibility of the webpages, we make use of their archived versions in the Internet Archive (http://archive.org). When adding new documents to the dataset, we find its most recent snapshot in the Archive and retrieve the meta data and texts from this snapshot.