On what date did you obtain the data? How did you obtain the data? What are the properties of the data set? How many samples are there, and what are the key variables you are interested in?

Overview Over the course of completing Assignment #5, you will conduct a small project on a topic that is useful or meaningful for you. You will either set your own question or objective, and you will adapt and build upon skills and concepts learned in class to solve new problems. Completion of this project involves synthesizing concepts and adapting skills you have learned throughout the semester.

Please see file #1 for candidate project ideas. Where relevant, part of your assignment may involve adapting class example scripts (or concepts), but it is also important to have a component of building onto that foundation by using a new package, analysis method, or visualization type. It is also important to implement good programming skills (such as creating functions and avoiding repetitive code). You are also welcome to propose your own topic, but you must speak with me for project approval at least three weeks before the assignment due date. Before making your project selection, encourage you to read over all of the project ideas carefully, and also check out the provided links.

It is useful to be aware of this array of resources. To ensure your success and minimize stress, start on this project right away! This project is meant to be completed over 3-4 weeks of part-time but regular effort. Assignment #5 serves as both a learning opportunity and an assessment tool. This assignment also provides training, serving as a valuable stepping stone towards completing larger projects (e.g. BINF*6999 or your thesis and beyond). Completing this project contributes towards achieving all five of our course-level learning outcomes: 1. obtain data from key databases relevant for bioinformatics and to understand the sources and limitations of these data 2. filter, manipulate, analyze, and visualize bioinformatics data 3. conduct reproducible analyses and use software tools for version control and collaboration

4. understand and apply selected algorithms commonly used in bioinformatics, including for sequence alignment and clustering 5. adapt the above skills to learn new tools and conduct new analyses not explicitly covered in class 1 General Guidelines For most class members, your project should be written entirely in R. It is required that either all or the majority of your assignment be written in R. If suitable for your project (e.g. see file 1 with candidate project topics), you may use a software tool outside of R as part of your work. It is essential that you explain clearly what you did in the other tool. Also, you must provide the output from that tool, as an input for your R script. You will submit both a PDF and an R script file or a RMarkdown file for your Assignment #5, using separate Dropbox folders on CourseLink. You should also include any needed input data files. You should show your code, not only the figures generated. Please take care with the total length of your PDF. The expected assignment length is 8-12 pages (maximum length 18 pages). This longer maximum page limit compared to Assignments 1 and 2 is there to accommodate the additional text sections and reference list. Long code and lengthy outputs are NOT expected. For example, instead of lengthy data outputs to screen (and to your PDF), use head() to limit the scope of what is printed. As well, there are some functions that have quietly = TRUE (or similar) as an optional argument, which you may consider for functions that produce long outputs to screen (e.g. outputting progress during sequence alignment). You can also write your own functions to reduce redundancy in your code. Organize your project using the numbered headings below, in order. Include the numbered headings in your assignment. You can add a more detailed sub-section title if you wish. E.g. “Introduction: Impact of Normalization Methods in Gene Expression Analysis”.

You should use GitHub to manage your own work. You may keep your repository private. Include your GitHub repository link at the top of your assignment. You may heavily draw from a Bioconductor vignette for your project, if suitable. Of course, you must cite any such sources used. However, running an existing vignette, with the provided data, end to end would not be a sufficient project. You must build upon such tutorials and include a novel component for your project. Many examples are provided in file

1. Examples include: comparing the results from using two packages, testing a biological hypothesis using a different data set compared to what is used in the vignette, combining two software tools for a novel project (e.g. combining DADA2 output with a phylogenetic community analysis, such as using phyloseq or picante), trying a range of parameter settings for selected functions and exploring the impact of methodological choices upon the results, or re-creating all or part of an interesting analysis or figure from a publication using your own code.

2 Detailed Project Requirements and Sections 1. Introduction (short written section of 2-3 paragraphs) (10%) Pose a question or objective that interests you, and set up your project. Using one of the themes outlined in file 1, or your own project idea (pending approval), narrow into a specific project objective and data set. Your project may involve building a tool (e.g. classifier), exploring an idea using visualizations, testing a specific biological hypothesis by applying analytical methods, or exploring the impact of analytical choices upon results. Be clear in your introduction: Which of these types of projects are you conducting? And, what is the objective of your project? Aim to cite 2-4 suitable references in your introduction. You may use any of the references uploaded to CourseLink or other literature you find on your topic. Guidelines: Please note that

would consider a “paragraph” to be approximately between one-quarter page and one-half page of text, using single spacing and 12-point font. Unless of an egregious departure from this guideline, t planning on using a ruler to scrutinize the exact amount of your text. However, please stay within this guideline when considering overall Introduction length! A paragraph is not a one-page, single-spaced block of text. Please focus on the quality of your project and stick to the length guidelines.

This length guideline for written paragraphs applies to all of the written sections in Assignment #5. Tips: The following is one possible way to organize your introduction into a nice flow. o Paragraph 1: What is the overarching topic you are interested in? Why is this scientifically important and interesting and/or societally relevant? o Paragraph 2: You can outline what is an important sub-area of research or a gap in knowledge. (As an example, perhaps your broader theme is gene expression analysis, and in paragraph

2 you could narrow in to introduce the importance of statistical choices.) o Paragraph 3: What is the specific objective of your study? What kind of study are you performing (e.g. Exploratory? Software comparison?

Hypothesis testing? Classifier creation? Learning about tools to re-create a published figure?) What will you do? 2. Description of Data Set (1 paragraph) (5%) Provide a short written description of your data set. What data set have you chosen to analyze to address your question? Describe the data set. Where are the data from, e.g. from the literature? from a public database?

On what date did you obtain the data? How did you obtain the data? What are the properties of the data set? How many samples are there, and what are the key variables you are interested in? If you are comparing groups (e.g. treatments, organisms, cell types, genes), describe the nature of the groups that you are comparing. Cite the source of the data set.

expect that most students will analyze a real, publicly available biological data set for this assignment. If suitable for your project, it is permissible to simulate data for your study, in 3 addition to using a biological data set. If you wish to consider data simulation, suggest speaking with me in advance. Also, if suitable for your project, it is permissible to use a dataset provided by your advisor, to gain experience analyzing a type of data you will be working with in the future. (However, note you can’t copy/paste among courses or to your thesis, so speak with me if you have any doubt about what is acceptable.) If so, you need to explain the source of the data. Depending upon the size of the data set, you may consider analyzing only a subset here, to keep the project scope manageable. Data should be analyzable on a desktop computer for this assignment. As well, please note that you will need to provide a data file to enable your code to run.

If unpublished data are included with your assignment, would treat such data as confidential, but you can ask your advisor to speak with me if they have any concerns. Please be sure to mention if you are using unpublished data in your assignment to ensur aware. (Students often bring to my attention interesting software tools or public vignettes that may wish to check out, so would want to know if there is anything confidential about your project.) 3. Code Section 1– Data Acquisition, Exploration, Filtering, and Quality Control (25%) Whenever possible, include code for data acquisition for your project. Also, include all data files needed for analysis with your submission. After data acquisition, you will next conduct data exploration, filtering, and quality control. What is suitable in this section will depend upon your project. Here are a few examples

: o Forkeyquantitative variables (e.g. organism length, mass, etc.), calculate data summaries, such as using the summary() function. Also, prepare a suitable plot, such as a boxplot or histogram or violin plot, showing the distribution of the data. If suitable for your project, you may use multiple panels in a single figure if you have multiple variables of interest. Check on outliers. Are these likely real data points, or could they be data entry errors? Note that we are looking for errors, not looking to exclude data that disagree with a specific hypothesis. You may decide, for example, that even if there is an outlier that it is likely real data and you might perform a data transformation so that the observation doesn’t have extreme influence upon your results. Or, you may perform your analysis with vs. without extreme data points for comparison.

o ForDNAsequence data, you may consider doing a check for sequence lengths and exclude sequences having very short length compared to the majority of your data, for example. The suitable choice would depend upon what gene and gene region you are analyzing. For example, for COI data for animals, there is only a very small amount of natural length variability (and thus very short sequences would be due either to poor quality or to metabarcoding projects which may involve sequencing short gene fragments). By contrast, in markers such as 18S, there can be a large amount of natural sequence length variability to consider. You may also consider excluding sequences with a large proportion of Ns.

A suitable visualization for DNA sequence data could be a histogram of sequence lengths. If 4 suitable for your project, you may investigate outliers based upon a pairwise distance matrix built either from k-mer frequencies or aligned sequences (i.e. are there sequences very different from others in the dataset?). Another helpful quality control step could be to build a visualization (such as a dendrogram, including scale bar of branch lengths) to look for sequences very different from others. You could then BLAST unusual sequences (using NCBI tool) and check the top few hits and see if there could be a misidentification in the dataset (or another serious issue, such as the wrong gene, a different gene region being sequenced, wrong sequence orientation, wrong organism/contamination, etc.). o ForRNA-Seqdata, you should examine library size among samples and make a suitable analytical choice that takes into account variability in library size. A suitable plot for that type of analysis could be to plot gene expression levels before vs. after normalization (e.g. see the gene expression vignette we ran together in class). You should also explicitly consider how you will choose to treat lowly-expressed genes for your project.

The above are a few examples. In general, for your project, think (and read) about: What can go wrong with the type of data you are analyzing? And, what types of variability might be expected in that type of data? Also, make a deliberate decision about how you will treat missing data for your project. Will you focus on genes or traits with a good sample size, for example? Do you need complete cases (e.g. can you only include species having sequence data for each of two genes)? If suitable for your project, you may consider incorporating imputation techniques. Your data exploration and quality control section must include at least 1 figure and a maximum of 3 figures. Your assignment in its entirety must contain between 3 and 6 figures. Guidelines: Please comment your code well. You do not need to explain R syntax in your commenting. Rather, you should explain what you are doing and why at each step.

Why did you make a particular choice? Provide a brief justification for choices you have made for arguments to the functions you are using (whether you use the default or not). Guidelines: Focus on quality of your code, not length. would expect that, in most cases, 1-3 PDF pages worth of high-quality, commented code (plus figures) will suffice for each of the code sections. As well, please note that high-quality code is often more concise than early-draft code. For example, think about whether you can reduce redundancy by using an apply function, foreach(), a for loop or while loop, or writing your own function (as suitable). Guidelines: Check your code for consistent and readable formatting. Guidelines: Stick to the guideline of 1-3 figures for this section. Your grade will be based upon quality.

If you include more than 3 figures, only the first 3 will be considered during grading. Tips: Remember to put “sanity checks” into your code. Are your data filtering steps and preliminary analysis steps doing what you want? Check that filtered data are as they should be. Take particular care when joining or sorting data to ensure that you don’t end up with incorrectly 5 associated data. In general, would highly recommend that you carry a unique identifier for each data point through every step of the analysis and that you join data based upon data values (rather than data order). An example of suitable identifiers from the sequence databases we have worked with would be the ProcessID from BOLD or the accession # from NCBI’s GenBank. Note that some alignment methods can end up reordering sequences compared to the input order; also, clustering and phylogenetic methods will reorder the data. Tips: Please take note that the credit weighting for this section is the SAME as the credit weighting for your main analysis. This is because these steps are so important in an overall project! So, please ensure you give sufficient attention to this section. 4. Main Software Tools Description (1 paragraph) (5%) Provide a short written description (1 paragraph) about the main software tool you will be using to answer your main question. Why did you make this choice? What are the expected strengths and weaknesses of the tool you chose?

Did you consider any alternatives? Cite the authors of the tool you used (you can cite the package itself plus the relevant associated publication, if available). If you are conducting a methodological project involving a comparison of tools, you might briefly describe two main tools in this section. Otherwise, you will typically describe one main software tool. Also, in this section, make it clear how you built upon existing vignettes for conducting your project. 5. Code Section 2– Main Analysis (25%) You would then move on to performing your main analysis. What would go in this section would depend upon your project, e.g.: writing code to build and test a classifier, build visualizations to explore ideas, conduct a statistical test of a biological hypothesis, or explore the impact of methodological choices upon results. Your main analysis section must also include a minimum of 1 figure and a maximum of 3 figures. The grade is based upon quality.

Your assignment as a whole must contain between 3 and 6 figures. Please note that extra figures will not be graded. Guidelines: Please see the guidelines for code section #1 above. The same general messages apply. Focus on quality of your code, not length, and comment your code well, etc. 6 6. Quality of Visualizations (20%) Throughout, ensure that your figures are clear and well labeled. Even for simple figures, such as histograms, ensure that you have accurate, informative axis labels. Include units of measurement in your axis labels. Also, consider readability, visual appeal, and accessibility. Use well-differentiated colours, and avoid relying upon the red-green spectrum to convey scientifically important information. Remember, you can consider using a combination of colour and symbol/pattern to convey your meaning. The grade in this section is based upon quality and novelty, not having the maximum permissible number of figures. You should have a total of 3-6 figures for your project overall. 7. Results and Discussion (short written section of 2-3 paragraphs) (10%) At the end, write a short section describing and interpreting what you discovered through conducting your project. Please find here an example outline that would provide a nice flow for this section: o Paragraph 1: Return to your original question. What is the answer to your question?

What did you discover? Were your results as expected or not? o Paragraph 2: Briefly describe any key caveats of your study. For example, are the conclusions that can be drawn limited by sample size or any other concerns? Were there biases in data availability that could have impacted your project? o Paragraph 3: What would be the next steps for this research? What would you do next if you had more time and if you were going to develop this work into a larger project? Did your results reveal any interesting preliminary findings that would be worthy of follow-up study? Guidelines: Please see introduction for length guidelines for written paragraphs.

What did you learn through completing your small project and this course? What lessons will you take forward in your future coursework, BINF*6999 or thesis, and future career? What skills do want to work on next to help you to meet your graduate program and career goals? You may consider reflecting upon your progress with technical skills as well as personal strategies, time management, etc. 7 9. Acknowledgements (section expected to be present for academic integrity) If you received project tips from others, include that information in this section. Briefly, indicate who you talked to, the nature of the advice, and how this impacted your project. You may speak to other class members and the course instructors.

It is NOT permitted to complete the assignment for someone else or to copy/paste blocks of code from others. If someone helped you to get unstuck when you were facing an error message, then indicate who and what you learned from this. To clarify: You ARE allowed to talk to others, but you are NOT permitted to let them fix the problem for you without you actually understanding what is going on. Write about: What was causing the error message, and what did you learn during the process of obtaining help to fix it? Throughout this assignment, you may look at online resources but are responsible for citing these and writing the code yourself. Also, you need to write all commenting and text sections in your own words, not using generative AI. 10. References (grade credit for this section is incorporated into the Intro and Discussion grades) If you cite any sources from the scientific literature, include them here in your reference list.

You may use any of the papers posted to CourseLink or other literature relevant for your project. Additional citations for vignettes, other tutorials, StackOverflow posts, etc., are not included in the 10-reference limit. You must cite all such sources. You would include scientific references as an in-text citation in the relevant sentence of your assignment, i.e. introduction or discussion (e.g. Xu et al. 2020). Also, list the full reference here at the end. If you used any specific online tutorials or a specific StackOverflow posting, for example, you must also include those here. Include the URL and the date accessed. You should choose a consistent format for your reference list.

hypothesis that could be tested in the future?) *refers to the literature to aid interpretation and discussion of next steps (e.g. 2-4 highly relevant references cited) *excellent quality writing, flows well, no or very few errors *Acknowledgements section is expected to be present for academic integrity, if you talked to anyone else about your assignment. *Reference list must also be included. You should cite your main software tools used, any scientific literature cited (3-6 papers recommended, 10 maximum), any vignettes that you consulted, online tutorials, specific StackOverflow postings, etc. Grade credit for the quality and relevance of the reference list is included in the Introduction and Discussion Grades. Total Graded out of 100% Valued at 30% of Course Grade 14

Last Completed Projects

topic title	academic level	Writer	delivered