# Quiz 6#

BEFORE YOU START THIS QUIZ:

1. Click on “Copy to Drive” to make a copy of the quiz,

2. Click on “Share”,

3. Click on “Change” and select “Anyone with this link can edit”

5. Paste the link into this Canvas assignment.

This quiz is open notes, open internet.

• You can ask for help from the instructor, but not from anyone else.

• You can use code you find on the internet, but if you use more than a couple of lines from a single source, you should attribute the source.

## A tree of sequences#

Suppose you have a large number of DNA sequences, represented using strings containing the characters A, C, G, and T, which are the bases that make up DNA.

One way to store these sequences is in a tree where each node has four children, one for each base. Here’s a class definition for such a node.

class Node:
def __init__(self):
self.children = [None, None, None, None]
self.end = False

def __repr__(self):
return f'Node({self.end}, {self.children})'


The instance variables are:

• children, which is a list of four Node objects, where None indicates that a child is missing.

• end: which is a bool that indicates whether this Node represents the end of a sequence.

I’ll use the following dictionary to map from each base to its index in the list of children.

index_map = dict(A=0, C=1, G=2, T=3)
index_map

{'A': 0, 'C': 1, 'G': 2, 'T': 3}


The following function inserts a new sequence into the tree.

def insert(root, sequence):
node = root
for base in sequence:
i = index_map[base]
if node.children[i] is None:
node.children[i] = Node()
node = node.children[i]
node.end = True


As an example, here’s a tree that contains only one sequence, CA.

node = Node()
insert(node, 'CA')
node

Node(False, [None, Node(False, [Node(True, [None, None, None, None]), None, None, None]), None, None])


The root of the tree has a single child, at index 1.

child = node.children[1]
child

Node(False, [Node(True, [None, None, None, None]), None, None, None])


The child has a single child, at index 0.

grandchild = child.children[0]
grandchild

Node(True, [None, None, None, None])


In the grandchild, end is True, which indicates that there is a sequence in the tree that ends at this node.

Here’s an example with more sequences.

tree = Node()
for sequence in ['ACGT', 'ACAT', 'CAT', 'CATATTAC']:
insert(tree, sequence)


I’ll use NetworkX and EoN to draw this tree.

try:
import EoN
except ImportError:
!pip install EoN

import networkx as nx

"""Traverse the tree and add edges to G."""
for child in parent.children:
if child:

G = nx.DiGraph()

def get_labels(parent, labels):
"""Traverse the tree and add node labels to a dictionary."""
if parent.end:
labels[parent] = '*'
else:
labels[parent] = ''

for child in parent.children:
if child:
get_labels(child, labels)

def get_edge_labels(parent, edge_labels):
"""Traverse the tree and add edge labels to a dictionary."""
bases = 'ACGT'
for i, child in enumerate(parent.children):
if child:
edge_labels[parent, child] = bases[i]
get_edge_labels(child, edge_labels)

from EoN import hierarchy_pos

def draw_tree(tree):
G = nx.DiGraph()
pos = hierarchy_pos(G)
labels = {}
get_labels(tree, labels)
edge_labels = {}
get_edge_labels(tree, edge_labels)
nx.draw(G, pos, labels=labels, alpha=0.4)
nx.draw_networkx_edge_labels(G, pos,
edge_labels=edge_labels,
font_color='C1')

draw_tree(tree)


## Question 1#

Write a function called find that takes as parameters

• A Node that represents the root of a tree and

• A string representing a sequence of bases.

It should return True if the sequence appears in the tree, and False otherwise.

You can use the following examples to test your code:

find(tree, 'CAT')  # should be True

True

find(tree, 'ACAT') # should be True

True

find(tree, 'TAG') # should be False

False

find(tree, 'CA') # should be False

False


## Question 2#

Write a function called find_all_rec that takes as parameters:

• A Node in a tree.

• A path that indicates the sequence of bases from the root to the current Node.

• A list of sequences.

This function should traverse the tree and add to the list all of the complete sequences it discovers.

Hint: Review make_table from huffman.ipynb.

You can use the following example to test your code.

t = []
find_all_rec(tree, '', t)
t

['ACAT', 'ACGT', 'CAT', 'CATATTAC']


The result should be a list with the following elements, not necessarily in this order

['ACAT', 'ACGT', 'CAT', 'CATATTAC']


## Question 3#

Write a function called find_all that takes a Node and a sequence of bases. It should traverse the tree and return a list that contains all sequences in the tree that begin with the given prefix.

Note: You can use find_all_rec as part of your solution even if your answer to the previous question does not work.

You can use the following examples to test your code.

find_all(tree, 'CA')   # Should return ['CAT', 'CATATTAC']

['CAT', 'CATATTAC']

find_all(tree, 'A')    # Should return ['ACAT', 'ACGT']

['ACAT', 'ACGT']

find_all(tree, '')     # Should return all sequences in the tree

['ACAT', 'ACGT', 'CAT', 'CATATTAC']


## Question 4#

Suppose we write a more general version of Node.__init__ that takes end and children as optional parameters.

class BadNode:
def __init__(self, end=True, children=[None, None, None, None]):
self.children = children
self.end = end

def __repr__(self):
return f'Node({self.end}, {self.children})'


And we write a version of insert that uses BadNode:

def bad_insert(root, sequence):
node = root
for base in sequence:
i = index_map[base]
if node.children[i] is None:
node = node.children[i]
node.end = True


If we use the new versions to make a tree, like this:

tree2 = BadNode()
for sequence in ['ACGT', 'ACAT', 'CAT', 'CATATTAC']:


It seems to work. But if we draw the tree, we get a RecursionError.

draw_tree(tree2)

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-67-f08ae257619c> in <module>
----> 1 draw_tree(tree2)

<ipython-input-50-bd629f554f38> in draw_tree(tree)
3 def draw_tree(tree):
4     G = nx.DiGraph()
6     pos = hierarchy_pos(G)
7     labels = {}

6         if child:

... last 1 frames repeated, from the frame below ...

6         if child:

In the cell that defines BadNode, write a comment that explains what the problem is, and then fix it.
Note: Your fixed version should still accept end and children as optional parameters.