The chart below is a position specific scoring matrix (PSSM
Bioinformatics assignment: motif finding and transcription factor binding site prediction
Name____________________________________
The chart below is a position specific scoring matrix (PSSM, a logarithmic transformed matrix) for a transcription factor binding site. (1). Evaluate a sequence “GACATTCA” to find out which segment of the sequence fits the binding site best. (2) What is the max score that a sequence can have with this PSSM? (3) What is the minimum score a sequence can have with this PSSM?
There are many programs for transcription binding site prediction available free on the internet. Find a few of them, list the links to at least 2 websites. Test with the following sequence to see how it works and briefly describe your test results.
ATCAGGCCCCCAAGGAAGATTTGAAGGGGGCGACTGCATAGTCGGGGTAT
CTCCCATATACCCAAGGAGATGGGTTTCTCTAACAGCACCCACGTCAGTC
CTAAATCTTCTTACTCTCCTGGATTAGAAGACTGTGTTTCCCAGGCCACA
TCTGAGAAGCCTGAGCTCCTTAGCCCTGAAATAGCAGAGTGCTGACAAGA
CACAGGGGCCTAGGGGCTCTGGAGTCCAAGGGGAGTCCTCAGCAGAAGAC
ACATAGGAGGCATTCTTTGTTGGGGCTGGCTTTTCTGTTTGCAAAGCCTG
CTTGAAATATCCTGCCCTTTCTATGGACACTTTCCTTAGGATATAACCTA
ATCTGTGGTTAATCACTATTCTT
The following is the alignment of a putative transcription factor binding site from various genes.
Position 123456
AGAACC
ACAAAG
CGAGGA
TCAAGT
AACAGA
AGATGA
GGAAGA
AGTAGA
ATGCTA
AGTAGA
Based on the alignment, (1) generate a position specific scoring matrix (PSSM), and (2) show what DNA sequence could get the highest score (or the highest probability fitting the model, (3) indicate a DNA sequence that could be scored the lowest (fit with the lowest probability in this matrix (4) determine which substring in the sequence “AACCGTAAC” has the most likely binding site for this transcription factor if there is one, and what substring is the least likely binding site.
DNA motif search
DNA motifs are normally very subtle and can only be detected using “alignment-independent” methods such as expectation maximization (EM) and Gibbs motif sampling approaches.
a. Use the DNA sequence at the botton of the file and generate alignment using the EM-based program Improbizer (www.cse.ucsc.edu/~kent/improbizer/improbizer.html) with default parameters. Copy and paste your results to your report file.
b. Do the same search using a Gibbs sampling-based algorithm AlignAce (http://www1.spms.ntu.edu.sg/~chenxin/W-AlignACE/). Copy and paste the sequences below to the correct input box. Change the number in the box following the Number of Column to Align to 6, and change the value in the box following Number of Sites Expected to 3. Copy and paste your results to your report file.
c. Compare the results of best scored motifs from both methods. Are there overlaps?
d. Copy and paste the first motif derived from AlignAce to nedit. Remove the illegal characters (spaces and numbers).
e. Cut and paste the motif alignment into the WebLogo program
(http://weblogo.berkeley.edu/logo.cgi) Click the “Create logo” button. Copy your result by hold ctrl and print screen at the same time. Paste your result to your report file.
f. Does the program find the highlighted region in the sequence?
Modified by Xiaofei Wang, 2017
Updated 2020
>Seq1
CACATCCCACCACAACCTTCCAGCAGCACGTGCAGGAACAGACAGGGGAA
TGGACGTAAGCGGCTCCTTAATATAATGTTGGGTCGTCGTAGGGATACCT
AGAAAGGTGTCCTGATATTAACCAC
>Seq 2
GAGCTAACATCAAAGCAGCACGTTTCCTAACTAAGACTACACATTTTCCA
TCTCACGTGCACAACTGAGTCCCCACTAGGACACTTTACAGACATTTGGA
>Seq 3
AAAGTGATGATCCTTCCTTTCCCTCCTAGATTAAATACTCATGTCCCACG
TGTACATCAGACTCAGCGCTGCTCGTAGCTGGAAACAAGATGGTGAAACT
>Seq 4
AGATCTGAATAATGAAGTAAGTTGTTCCCTTACACATGCAGCAGAAACTG
CCATTGCCTTCAAGAGCTGCAGAATAACACACGTGTGCTGTTCTGCGGGG
>Seq 5
TCAAGACCACGTGAAAGGCCGAGGTGGGTGGATCACTTGAGGTCAGGAAT
CAGCCAGGCCAACACGGCAAAAGCCTGTCTCTACAAAAAATACAAAAAAT
TAGCAGGGGATGGTGGTGTGTGTCTGTAGTCCCAGCTATTGCAGTGAGCA
>Seq 6
CAAGCAGGCTTAAACAAAATTCAATATCTGGACACATTGTAGTTAAACCA
CGTGACACTGTTATCACTGTCACACACATCTGTGTGAAGAGACCACCAAA
ACCTAGTAGATCGTA
>Seq 7
TAGGCTTCATGTGAGCAATAAAGCTTTTTAATCACCTGGGTGCACGTGGG
CTGAGTCCAAAAAAGGAGTCAGCAAAGGGTGGTAGGATTATCATTAGTTC
TTGAGATCCGATCAAATGCTATCCCCGTTATHAY
>Seq8
CACACATACACACACCAGACACACACCACACGTGCATACACAGACACACA
CCACACGCACTCGCTCGCGCGCACACACACACACACTTTTTATATACAAA
>Seq 9
CCAAATTCAGAAAACATCACGTGGCTTTTTACAATGTTTTCAGCAGCATA
GAACTTTTGCTGCAATGTCGTCGTATATGTTCCCTAGGATATAGTCTCAATCT
TGGTATTGTAGCTGATAGTCTGTAAGGGTTTCCCCCAGTAACT