The chart below is a position specific scoring matrix (PSSM

The chart below is a position specific scoring matrix (PSSM

Bioinformatics assignment: motif finding and transcription factor binding site prediction

Name____________________________________

The chart below is a position specific scoring matrix (PSSM, a logarithmic transformed matrix) for a transcription factor binding site. (1). Evaluate a sequence “GACATTCA” to find out which segment of the sequence fits the binding site best. (2) What is the max score that a sequence can have with this PSSM? (3) What is the minimum score a sequence can have with this PSSM?

There are many programs for transcription binding site prediction available free on the internet. Find a few of them, list the links to at least 2 websites. Test with the following sequence to see how it works and briefly describe your test results.

ATCAGGCCCCCAAGGAAGATTTGAAGGGGGCGACTGCATAGTCGGGGTAT

CTCCCATATACCCAAGGAGATGGGTTTCTCTAACAGCACCCACGTCAGTC

CTAAATCTTCTTACTCTCCTGGATTAGAAGACTGTGTTTCCCAGGCCACA

TCTGAGAAGCCTGAGCTCCTTAGCCCTGAAATAGCAGAGTGCTGACAAGA

CACAGGGGCCTAGGGGCTCTGGAGTCCAAGGGGAGTCCTCAGCAGAAGAC

ACATAGGAGGCATTCTTTGTTGGGGCTGGCTTTTCTGTTTGCAAAGCCTG

CTTGAAATATCCTGCCCTTTCTATGGACACTTTCCTTAGGATATAACCTA

ATCTGTGGTTAATCACTATTCTT

The following is the alignment of a putative transcription factor binding site from various genes.

Position 123456

AGAACC

ACAAAG

CGAGGA

TCAAGT

AACAGA

AGATGA

GGAAGA

AGTAGA

ATGCTA

AGTAGA

Based on the alignment, (1) generate a position specific scoring matrix (PSSM), and (2) show what DNA sequence could get the highest score (or the highest probability fitting the model, (3) indicate a DNA sequence that could be scored the lowest (fit with the lowest probability in this matrix (4) determine which substring in the sequence “AACCGTAAC” has the most likely binding site for this transcription factor if there is one, and what substring is the least likely binding site.

DNA motif search

DNA motifs are normally very subtle and can only be detected using “alignment-independent” methods such as expectation maximization (EM) and Gibbs motif sampling approaches.

a. Use the DNA sequence at the botton of the file and generate alignment using the EM-based program Improbizer (www.cse.ucsc.edu/~kent/improbizer/improbizer.html) with default parameters. Copy and paste your results to your report file.

b. Do the same search using a Gibbs sampling-based algorithm AlignAce (http://www1.spms.ntu.edu.sg/~chenxin/W-AlignACE/). Copy and paste the sequences below to the correct input box. Change the number in the box following the Number of Column to Align to 6, and change the value in the box following Number of Sites Expected to 3. Copy and paste your results to your report file.

c. Compare the results of best scored motifs from both methods. Are there overlaps?

d. Copy and paste the first motif derived from AlignAce to nedit. Remove the illegal characters (spaces and numbers).

e. Cut and paste the motif alignment into the WebLogo program

(http://weblogo.berkeley.edu/logo.cgi) Click the “Create logo” button. Copy your result by hold ctrl and print screen at the same time. Paste your result to your report file.

f. Does the program find the highlighted region in the sequence?

Modified by Xiaofei Wang, 2017

Updated 2020

>Seq1

CACATCCCACCACAACCTTCCAGCAGCACGTGCAGGAACAGACAGGGGAA

TGGACGTAAGCGGCTCCTTAATATAATGTTGGGTCGTCGTAGGGATACCT

AGAAAGGTGTCCTGATATTAACCAC

>Seq 2

GAGCTAACATCAAAGCAGCACGTTTCCTAACTAAGACTACACATTTTCCA

TCTCACGTGCACAACTGAGTCCCCACTAGGACACTTTACAGACATTTGGA

>Seq 3

AAAGTGATGATCCTTCCTTTCCCTCCTAGATTAAATACTCATGTCCCACG

TGTACATCAGACTCAGCGCTGCTCGTAGCTGGAAACAAGATGGTGAAACT

>Seq 4

AGATCTGAATAATGAAGTAAGTTGTTCCCTTACACATGCAGCAGAAACTG

CCATTGCCTTCAAGAGCTGCAGAATAACACACGTGTGCTGTTCTGCGGGG

>Seq 5

TCAAGACCACGTGAAAGGCCGAGGTGGGTGGATCACTTGAGGTCAGGAAT

CAGCCAGGCCAACACGGCAAAAGCCTGTCTCTACAAAAAATACAAAAAAT

TAGCAGGGGATGGTGGTGTGTGTCTGTAGTCCCAGCTATTGCAGTGAGCA

>Seq 6

CAAGCAGGCTTAAACAAAATTCAATATCTGGACACATTGTAGTTAAACCA

CGTGACACTGTTATCACTGTCACACACATCTGTGTGAAGAGACCACCAAA

ACCTAGTAGATCGTA

>Seq 7

TAGGCTTCATGTGAGCAATAAAGCTTTTTAATCACCTGGGTGCACGTGGG

CTGAGTCCAAAAAAGGAGTCAGCAAAGGGTGGTAGGATTATCATTAGTTC

TTGAGATCCGATCAAATGCTATCCCCGTTATHAY

>Seq8

CACACATACACACACCAGACACACACCACACGTGCATACACAGACACACA

CCACACGCACTCGCTCGCGCGCACACACACACACACTTTTTATATACAAA

>Seq 9

CCAAATTCAGAAAACATCACGTGGCTTTTTACAATGTTTTCAGCAGCATA

GAACTTTTGCTGCAATGTCGTCGTATATGTTCCCTAGGATATAGTCTCAATCT

TGGTATTGTAGCTGATAGTCTGTAAGGGTTTCCCCCAGTAACT