PlatformMarketplaceSolutionsResourcesOpen DatasetsCommunityCompany
update dataset overview and ba...
Feb 10, 2022 7:35 AM


We describe a competitive question generation and answering project used in our undergraduate natural language processing courses. This semester-long project challenges teams of three or four students to use available NLP components (or develop their own) to construct systems that ask and answer questions about an arbitrary Wikipedia article. We describe how the project and competition were structured, the outcomes, and lessons learned. The Question/Answer dataset generated by students who took undergraduate natural language processing courses taught by Noah Smith at Carnegie Mellon and Rebecca Hwa at the University of Pittsburgh during Spring 2008, Spring 2009, and Spring 2010.

Data Collection

The project proceeded in 4 phases of a 15-week semester: data preparation (weeks 1–4), during which the first few course lectures introduced the most important concepts for getting started in NLP and motivating applications; system development (weeks 5–12), during which teams worked on their systems as they learned more about problems and solutions in NLP; evaluation/competition (weeks 13–14); and live demonstrations (hosted by the local Google office) at the end. The first and third phases are most relevant.

Data Annotation

There are three directories, one for each year of students: S08, S09, and S10.

The file "question_answer_pairs.txt" contains the questions and answers. The first line of the file contains column names for the tab-separated data fields in the file. This first line follows:


Field 1 is the name of the Wikipedia article from which questions and answers initially came.
Field 2 is the question.
Field 3 is the answer.
Field 4 is the prescribed difficulty rating for the question as given to the question-writer.
Field 5 is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.
Field 6 is the relative path to the prefix of the article files. html files (.htm) and cleaned text (.txt) files are provided.

Questions that were judged to be poor were discarded from this data set.
There are frequently multiple lines with the same question, which appear if those questions were answered by multiple individualsThis particular release was prepared by Kevin Gimpel, but the data collection process was performed by Noah Smith, Mike Heilman, Rebecca Hwa, Shay Cohen, and many CMU students and Pitt students.


The project requirements provided considerable flexibility. Students could develop their systems in any programming language, and they were allowed to use existing NLP components available on the Web. The command-line interface for the question generation program was

./ask art.txt N

where art.txt is a file containing the text of a Wikipedia article, and N is a positive integer telling how many questions to generate. The program is expected to print to standard output a sequence of newline-separated N questions about the article that a human could answer, given the article. Students were instructed to aim for questions that are "fluent and reasonable.” The answering program has a similar interface:

./answer art.txt q.txt

where q.txt lists questions in the same format as ask’s output. Answers are to be written to standard output, one per line. Students were instructed to aim for answers that are fluent, correct, and intelligent. Note that there is no document retrieval component to this project; questions and answers always pertain to a specific, known document.


author={Noah A. Smith, Michael Heilman, and Rebecca Hwa},
title={Question Generation as a Competitive Undergraduate Course Project},
booktitle={In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation
Challenge, Arlington, VA, September 2008.},
url={Available at:}
🎉Many thanks to Graviti Open Datasets for contributing the dataset
Basic Information
Application ScenariosNot Available
AnnotationsNot Available
TasksNot Available
LicenseCC BY-SA 3.0
Updated on2022-02-10 07:35:42
Data TypeNot Available
Data Volume0
Annotation Amount0
File Size0.00B
Copyright Owner
Carnegie Mellon University