## Summer 2017 Research Projects

Below you will find a list of faculty that will be conducting research during the Summer of 2017 and are looking for research students.

**Visual Text Analysis for the Digital Humanities (Eric Alexander)
Finding Consensus Among Tumor Evolutionary Histories (Layla Oesper)
Understanding Human Learning Using Machine Learning (Anna Rafferty)
Counting Triangular Puzzle Tilings (Jed Yang)
**Descriptions of the projects are below:

**Project: Visual Text Analysis for the Digital Humanities (Eric Alexander)**

Researchers have access to more digital text than ever before, from websites to newspaper articles to books. This availability offers the potential to answer sweeping questions about the evolution of literature and language at scales previously unheard of--so long as we can actually make *sense* of all the data we have. Research in natural language processing has provided us with powerful statistical techniques to model the behavior of text within a large collection of documents. However, using and interpreting such models can present a challenge to those whose expertise lies outside the field of statistics. In my research, I design, develop, and evaluate visual techniques for putting statistical text analysis into the hands of researchers with a wide variety of backgrounds.

This summer, I am looking for 2-3 students to take part in research projects to this end. Potential projects include the following:

**Semantic document chunking:**The act of training a good statistical model on a body of text usually lies somewhere between a black box and an art. In this project, students will design and develop techniques to open up a key part of this black box--namely, that of dividing documents into semantically cohesive parts, or “chunks.” They will explore methods of algorithmically determining semantic shifts within documents, as well as design visual depictions of these shifts so that they can be tuned by domain researchers.

**Character sonic signatures:**Different characters within literature are sometimes attributed with different voices--not just in the types of words they use, but in the*sound*of their speech. For instance, in Shakespeare’s*Othello*, the titular character is sometimes described as having slower, rounder speech when compared to the quick, staccato dialog of the villain Iago. Can this be detected algorithmically? Are some authors better at differentiating the speech of their individual characters than others? Students in this project will investigate ways of classifying text by sound, as well as methods for visually conveying such classifications to a reader.

**Evaluation of visual text summaries:**There are many ways of representing statistical summaries of document collections, from bar charts to word clouds. The efficacy of these techniques for actually*conveying*a summary to a reader is hotly debated. In this project, students will investigate the ability of human participants to retrieve summary information from different visualization types.

The precise trajectory of these projects is open-ended, to be steered by the particular backgrounds and interests of the students involved. Some experience with statistical models and/or machine learning would be helpful, but is not required. In an ideal world, accepted students would enroll for Data Visualization (CS 314) in the spring and potentially take a 1-credit independent study to prepare for their project.

**Project: ** **Finding Consensus Among Tumor Evolutionary Histories (Layla Oesper)**

Cancer is a disease resulting from the accumulation of genomic alterations that occur during the individual’s lifetime and cause the uncontrolled growth of a collection of cells into a tumor. These mutations occur as part of an evolutionary process that may have begun decades before a patient’s diagnosis. Better understanding about the history of a tumor’s evolution over time may yield important insight into how and why tumors develop as well as which mutations drive their growth. While recent algorithmic progress has led to improved inference of tumor evolutionary histories, there is still much room for improvement. For example, a consensus history built from multiple potential tumor evolutionary histories (each constructed from an existing algorithm) has the potential to provide a richer and more accurate description of a tumor’s history.

This summer project will be part of an ongoing initiative to develop methods that given a collection of evolutionary histories, infers a single consensus tumor evolutionary history. The exact details of what students will be working on will depend on their interests, background and how the project progresses prior to the start of summer. Aspects of the project that students may likely work on include:

(1) Extending existing methods (or developing new ones) to incorporate additional input data derived from DNA sequencing experiments.

(2) Transforming software prototypes into releasable products.

(3) Initial collection and processing of real DNA data from online databases.

I expect to hire multiple students for this project. Students who are accepted will work for 10 weeks during the summer of 2017. Ideally, students should be available to participate in an independent study during the spring of 2017 to read papers, familiarize themselves with related tools/concepts, and have discussions to begin planning the project. Applicants should have completed CS 201 and either CS 202 or MATH 236. Students who have taken Computational Biology, Bioinformatics or Algorithms are also strongly encouraged to apply. No specific biology background is required, just an interest in applying computational techniques to important biological problems.

**Project: Understanding Human Learning Using Machine Learning (Anna Rafferty)**

Learning analytics and machine learning models are increasingly used in educational technologies, facilitating the creation of adaptive systems that dynamically provide students with practice in areas where they struggle. These systems can be engaging and motivating for students, and they offer opportunities to better understand human learning by logging fine-grained information about students' problem solving choices. In my research, I focus on building algorithms that can make inferences about understanding based on the data collected in educational technologies. These algorithms typically involve machine learning as well as models of human learning. Once I have an algorithm that can make these inferences, I explore questions like how can we use the inferences to make personalized decisions about feedback and instructions or how can we automatically improve the algorithm to make better inferences.

In the research this summer, you'll be focusing on one of two general questions:

1) How can we automatically sequence the problems a student solves in an online algebra tutor to gain as much information as possible about what she misunderstands?

2) How can we use data to identify a space of possible strategies or possible misunderstandings that learners might have about a particular domain, such as algebraic equation solving or high school chemistry?

This research is part of a larger project in which we've been developing machine learning techniques for inferring misunderstandings based on students' choices, and we've deployed our algorithm in a free algebra tutor where learners can practice equation solving. Through this project, we've collected a great deal of data on solving algebra equations, which is likely to be the primary type of data you're working with, and we may also touch on data about learners' choices in other educational technologies, such as a sequence of virtual chemistry labs. In addition to addressing one of these research questions, you may spend some time helping out on other parts of the larger project to gain a better understanding of the overall goals. The exact focus of your work will depend on how things evolve over the course of the spring and summer and your own interests and background. This project combines cognitive science, machine learning, and statistics, so it's a great opportunity to see the multidisciplinary nature of computer science!

CS 201 and CS 202, including especially the material on probability, are appropriate preparation for this project. Experience with statistics, AI or data mining, or probability at a more advanced level than CS 202 would be a definite plus, but is not required. Ideally, you would do a 1-credit independent study with me in the spring to start reading relevant background material and learn about the data you'll be working with.

Students will be hired for 6-8 weeks of research; we will discuss the exact number of weeks in the second stage of the application process. This research is compatible with also being an RA for the Summer Computer Science Institute (SCSI), and I strongly encourage you to apply for that position if you're interested, as that would allow you to work for a total of 10 weeks between research and SCSI.

**Project: Counting Triangular Puzzle Tilings (Jed Yang)**

The combinatorial problem of counting the number of ways to tile a region is computationally hard (#P-complete) in general. The counts of certain specific sets of tiles and regions have deep connections in mathematics. For example, tiling triangular regions with a set of 3 "puzzle" tiles gives Littlewood-Richardson coefficients, which are numbers that occur in seemingly unrelated fields of mathematics. In my current research program, I make small modifications to these tiles, which change the counts. If we are lucky (more on this below), these new numbers will satisfy certain well-defined algebraic equations, signifying that they may be secretly counting some features of (possibly unknown) mathematical objects. One eventual goal of my program is to discover new mathematical structures by experimenting with simple tiles. This research has the potential of having wide impact in mathematics due to the natural (and mysterious) connections to various fields such as in studying symmetric functions, Grassmannians, and representation theory.

This summer, I would like to work with students to remove the "luck" component of the above mathematical description. Instead of doing the modifications by hand, I would like to automate this process. Specifically, in Stage 1, you will develop computer software that can count these triangular tilings efficiently. In Stage 2, we will use these programs to automate the exploration of modifications of tiles and test whether the counts obey the aforementioned algebraic equations. Finding many such examples will be a first step in understanding how to connect these tiling results to other fields. On one hand, Stage 1 is feasible since slow algorithms are readily available; on the other hand, it has challenge and value because the faster we can make our algorithms, the more variations we can test in Stage 2. You will gain valuable skills in algorithm design, implementation, and software development. Depending on your interests and background, we may also work on the more mathematical side of tiling theory.

I plan on hiring 2 students for 5-10 weeks of research. When explaining why you want to work on this project, please note whether you are interested in a 5-week position, 10-week position, or both.

CS 201 or its equivalent is appropriate preparation for this project. In particular, CS 202 (or other math background) is not required. However, students with strong interests in mathematics are encouraged to apply. Ideally, students should be available to complete a 1-credit independent study during the spring to read papers, familiarize with background, and plan for the summer.