CSE 477: Processing of Strings and Sequences

Course Information

Instructor

Dr. Jaroslaw Zola

For all email communication, please make sure to add prefix [PSS] to mail subject.

Course Description

This course is intended for students interested in learning efficient techniques for processing and analyzing large text collections, such as large-scale system logs, massive text corpora or databases of DNA sequences. The main focus is on classic algorithms and data structures for strings and sequences, including pattern matching, pairwise comparison, indexing and searching, as well as probabilistic methods, like fingerprinting and hashing. The theoretical component is complemented by practical considerations regarding efficient implementations of the discussed algorithms, and their applications in the real-world systems. The example applications include tools like UNIX grep, frameworks for plagiarism detection, as well as tools driving computational biology (e.g., BLAST, DNA assemblers, etc.). The course has also a programming component, in which students implement in their language of choice small but fully functional text processing applications.

This course is the Theory/Algorithms focus area course at CSE.

Course Organization

The course consists of a series of lectures covering multiple algorithms on strings and sequences, including their design, analysis and real-world applications. Lectures are complemented with a programming assignments exposing practical aspects of the covered material.

Below is a list of topics covered in the course:

  1. Exact pattern matching: Knuth-Morris-Pratt, Boyer-Moore, Z-box, Aho-Corasick and semi-numerical algorithms.
  2. Radix and Suffix Trees: construction, querying, applications.
  3. Suffix Arrays and LCP arrays: construction, querying, applications.
  4. Burrows–Wheeler Transform and FM-Index: construction, querying, applications.
  5. Winnowing, fingerprinting and locality sensitive hashing for text processing.
  6. Inexact matching and pairwise sequence comparison: Levenshtein distance, Smith-Waterman and Needleman-Wunsch algorithms.
  7. Succinct data structures: bit vector, wavelet trees (time permitting).

Course Prerequisites

The course is intended for computer science majors, computer engineering majors, or requires permission of the department. The course requires some good experience in basic probability and synthesis and analysis of algorithms at the level of “CSE250: Data Structures and Their Algorithms.” The course has a programming component, hence the ability to write working and correct code is must have. Note that the course is programming language oblivious.

Program Outcomes

Upon completion of this course you will:

The course adopts the Student Outcomes from the Engineering Accreditation Commission of ABET (see here).

Course Learning Outcome Program Outcomes/Competencies Instructional Method(s) Assessment Method(s)
Gain basic skills required to design and analyze algorithms on strings and sequences. 1 Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions. Lectures, Assignments Exams, Assignments
Be able to select and apply proper algorithms to process large text corpora, including large system logs, text documents, biomedical records, and DNA databases. 2 Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline. Lectures, Assignments Exams, Assignments
Be able to implement string processing algorithms at various levels of complexity (e.g., with and without indexing). 6 Apply computer science theory and software development fundamentals to produce computing-based solutions. Lectures, Assignments Assignments

Program Outcome support

Program Outcome 1 2 3 4 5 6
Support Level 3 3 3

Not Supported, 1: Minimally Supported, 2: Supported, 3: Strongly Supported

Course Requirements

The course has three requirements:

  1. Midterm exam testing your understanding of the most basic string algorithms and the ability to reason about their performance and applicability.
  2. Programming assignments exposing you to the practical aspects of the covered material. Each assignment will be a mini-project implementing a small text processing application (e.g., grep, etc.).
  3. Final exam testing your overall understanding of the material.

Grading Policy

The final grade will be weighted average: 20% midterm exam, 30% final exam, 50% programming assignments. The number-to-letter grade mapping will be done as indicated in the table below.

Score Grade Points
95-100 A 4.0
90-94 A- 3.67
80-89 B+ 3.33
70-79 B 3.0
60-69 B- 2.67
55-59 C+ 2.33
50-54 C 2.00
45-49 C- 1.67
40-44 D 1
0-39 F 0.0

In general, no incomplete grades (“IU” or “I”) will be given. However, in special circumstances that are truly beyond your control and justify incomplete grade, we will follow the university policy on incomplete grades, available here.

Course Materials

This course does not have a required textbook. However, the following book is highly recommended, as the course is roughly based on its content:

  1. D. Gusfield, “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology,” Cambridge University Press, 1997.

Additional readings (e.g., papers, tutorials, etc.) will be referenced thoughtout the course as needed.

Academic Integrity

Academic integrity is a fundamental university value. Through the honest completion of academic work, students sustain the integrity of the university and of themselves while facilitating the university’s imperative for the transmission of knowledge and culture based upon the generation of new and innovative ideas.

You must be familiar with the university and departmental policies on academic integrity!!! The university policies for are available here. The CSE policies are available from the CSE web page.

Any violation of these policies, including but not limited to cheating on any course deliverable (e.g., homework project, exam, etc.), will result in automatic failure of the course. There will be no leniency!

Thank you for upholding your own personal integrity and ensuring UB’s tradition of academic excellence!

Use of Generative AI Models

The current generative AI large language models, such as, e.g., ChatGPT, LLaMA, etc. perform extremely poorly when challenged with the content of this course. In fact, virtually all prompts to the current models lead to incorrect answers, hallucinated answers or irrelevant answers. Because of that, the use of generative AI is not accepted or encouraged in this course. Any use of AI to cheat in the course will be considered a violation of the academic integrity rules.

Accessibility Resources

If you have any disability which requires reasonable accommodations to enable you to participate in this course, please contact the Office of Accessibility Resources in 60 Capen Hall, 716-645-2608 and also the instructor of this course during the first week of class. The office will provide you with information and review appropriate arrangements for reasonable accommodations, which can be found on the web at: http://www.buffalo.edu/studentlife/who-we-are/departments/accessibility.html.

University Support Services

As a student you may experience a range of issues that can cause barriers to learning or reduce your ability to participate in daily activities. These might include strained relationships, anxiety, high levels of stress, alcohol/drug problems, feeling down, health concerns, or unwanted sexual experiences. Counseling, Health Services and Health Promotion are here to help with these or other issues you may experience. You learn can more about these programs and services by contacting:

Counseling Services 120 Richmond Quad (North Campus), 716-645-2720 202 Michael Hall (South Campus), 716-829-5800 https://www.buffalo.edu/studentlife/who-we-are/departments/counseling.html

Health Services Michael Hall (South Campus), 716-829-3316 https://www.buffalo.edu/studentlife/who-we-are/departments/health.html

Office of Health Promotion 114 Student Union (North Campus), 716-645-2837 https://www.buffalo.edu/studentlife/who-we-are/departments/health-promotion.html.

Sexual Violence

UB is committed to providing a safe learning environment free of all forms of discrimination and sexual harassment, including sexual assault, domestic and dating violence and stalking. If you have experienced gender-based violence (intimate partner violence, attempted or completed sexual assault, harassment, coercion, stalking, etc.), UB has resources to help. This includes academic accommodations, health and counseling services, housing accommodations, helping with legal protective orders, and assistance with reporting the incident to police or other UB officials if you so choose. Please contact UB’s Title IX Coordinator at 716-645-2266 for more information. For confidential assistance, you may also contact a Crisis Services Campus Advocate at 716-796-4399.

Please be aware UB faculty are mandated to report violence or harassment on the basis of sex or gender. This means that if you tell me about a situation, I will need to report it to the Office of Equity, Diversity and Inclusion. You will still have options about how the situation will be handled, including whether or not you wish to pursue a formal complaint. Please know that if you do not wish to have UB proceed with an investigation, your request will be honored unless UB’s failure to act does not adequately mitigate the risk of harm to you or other members of the university community. You also have the option of speaking with trained counselors who can maintain complete confidentiality. UB’s Options for Confidentially Disclosing Sexual Violence provides a full explanation of the resources available, as well as contact information. You may call UB’s Office of Equity, Diversity and Inclusion at 716-645-2266 for more information, and you have the option of calling that office anonymously if you would prefer not to disclose your identity.

Copyright 2024 Jaroslaw Zola jzola@buffalo.edu