Reliable and low-cost test collections construction using machine learning




Rahman, Md Mustafizur (Ph. D. in information studies)

Journal Title

Journal ISSN

Volume Title



The development of new search algorithms requires an evaluation framework in which A/B testing of new vs. existing algorithms can be reliably performed. While today's search evaluation methodology is reliable, it relies heavily upon people manually annotating the relevance of many search results, which is slow and expensive. Moreover, this practice has become increasingly infeasible as digital collections have grown ever-larger. Consequently, there is an urgent need today for better IR evaluation methods that are both cost-effective and reliable. My doctoral research focuses on developing low-cost yet reliable IR evaluation methods by integrating state-of-the-art machine learning (ML) techniques with traditional human annotation. More specifically, in this dissertation, I focus on improving system-based IR evaluation methods that rely on constructing test collections. I present my work in four directions: i) understanding the effects of the participating systems on the qualities of a test collection, ii) modeling a machine learning system to reduce the human annotation efforts for a given search topic, iii) allocating annotation budget across search topics via a dynamic feedback loop between a reinforcement learning method and an active learning algorithm, and iv) developing a dataset for hate speech by adapting methods for constructing test collections in IR. In the first direction, I investigate how the number of participating systems impacts the qualities of a test collection. Then I propose a robust prediction model that can be utilized to predict the qualities of test collection even before collecting relevance judgments. As for the second direction, I seek to reduce the human annotation effort needed to evaluate IR systems by using active learning. Specifically, rather than relying entirely on human annotators to judge search results, I propose an amalgam of human annotation and machine intelligence. In the third direction, I aim at predicting how human judging effort can be intelligently allocated across different search topics. Whereas traditional approaches allocate the same human judging effort across different search topics, I utilize reinforcement learning which in combination with the active learning algorithm, enables us to allocate budget dynamically for each search topic. Finally, I develop a dataset for hate speech by exploring the ideas of developing test collections in IR. My hate speech dataset has a broader coverage of hate speech than prior hate speech datasets.



LCSH Subject Headings