To get many different processes started in our nucleus, we need proteins to bind to specific sequences on our DNA. For example, the gene-editing tool CRISPR/CAS9 needs to locate a small target sequence, or a protein needs to regulate a particular gene. How do these molecules find the right spot on the DNA? They need to bind to specific base-pair sequences which are short – hundreds of million times shorter than our DNA. Without a doubt, these proteins face a needle-in-a-haystack problem.
Despite so many possible targets, protein search times are short. For example, the Lac repressor in E. coli bacteria needs 1-5 minutes to find its target site. The search is twice as fast as a random search in 3D – by diffusion– in the bacterium’s volume. In other experiments, outside the cell, researchers observed that proteins home in on targets faster if the DNA is coiled rather than straight. These observations suggest that proteins find their targets in other ways than 3D diffusion and that DNA’s folding plays a role.
However, until recently, we could not model protein target-search in human cells because we did not have genome-wide data on DNA’s 3D organization. This changed with the Hi-C method. Hi-C detects the number of physical contacts between all pairs of DNA fragments – as small as 1,000 base-pairs – across the genome. Now we can access data about which DNA fragments are closest to each other, and ask questions about how our DNA is organized, and how to explain protein target-search times.
Building on these ideas, we made a model for protein target-search in human cells. In the model, we use Hi-C data as a proxy for DNA’s 3D structure. From this data, we construct a network where every node represents a DNA segment, and the strength of the links that connect them is proportional to the number of contacts detected in Hi-C. We then model the proteins’ search as particles that move randomly over the network until they find a target. This model allows us to calculate the average search time for every node.
Then we asked: what characterizes regions that are easier to find than others? Using genetic data from ENCODE, we found that gene-rich regions are easier to find than gene-poor and that active genes are easier to find than inactive.
The framework that we developed for this study offers a new way to interpret protein-binding data that can’t be explained just by the DNA sequence. The physical organization of the DNA matters in gene regulation.