Luận án tiến sĩ: Simulation of stochastic chemical systems: Applications in the design and construction of synthetic gene networks

We use a toolbox of well-characterized genetic components described by kinetic and thermodynamic data, detailedphysical and chemical models of regulated gene expression, advanced stochas

The Numerical Solution of Itô Stochastic Differential Equations

Definitions and Formal Solutions 000 47

Consider a system with a state vector described by N variables The state vector is

In a stochastic system represented by the set X = {X1, X2, , Xy}, where each variable is real-valued, the system is influenced by one or more continuous-valued random processes According to the Central Limit Theorem, the Wiener process, W(t), serves as an effective model for these continuous-valued random processes For each process impacting the system, an additional Wiener process is incorporated, allowing for the mathematical modeling of the time evolution of various systems through a Wiener-driven vector stochastic differential equation.

The stochastic differential equation (SDE) represented as M dX(t) = œ(X, t) dt + ∑ oj(X, t) dWj describes a system with M Wiener processes, where dWj denotes the time derivative of the j-th Wiener process, and state variables range from i = 1 to N If the coefficients a and b are linear or constant regarding X, the SDE remains linear; if they are constant, it exhibits additive noise, while any dependency of b on X indicates multiplicative noise The integral form of the SDE can be expressed as x(t) = ∫ œ(t') dt' + ∫ ∑ G(X, t) dWj, where the integral of the Wiener process is termed a stochastic integral, which can be computed using either the Itô or Stratonovich definitions The classification of the SDE as Itô or Stratonovich depends on the chosen definition, although an Itô SDE can be transformed into a Stratonovich SDE by altering the drift coefficients.

Consider a scalar linear It6 SDE driven by a single Wiener process, dX, = (ay(t)X;+an(t)) dt + (by (t)X, +bo(t)) dW (2.69)

It has the formal solution

X=, (x + [lots aa(s)—bi(s)ba(s)) ®c} as+ [ bá bạ(s)®-} av) (2.70) where ®, ;„ is the fundamental solution, ®, ;, = exp (| (a6 — 509) ds + [os av) (2.71)

The mean u(t) and variance v(t) of the solution to Eq (2.70) satisfies two ordinary differential equations aul = ai?) u(t) + aa(f) (2.72) and TU = (a(t) +Bẽ()) (0) +bi)bs()0) + he) +93 2.73)

The solution to a homogeneous linear stochastic differential equation (SDE) with constant coefficients and an initial condition following a normal distribution is also a Gaussian distribution, with its mean and variance defined by specific equations In contrast, if the conditions are not met, the solution typically does not follow a Gaussian distribution For systems involving multiple linear SDEs influenced by various Wiener processes, applying spectral decomposition to the drift vector and diffusion tensor can result in a set of decoupled linear SDEs, driven by a modified Wiener process, maintaining a formal solution structure.

Explicit Solutions of Some Stochastic Differential Equations

A formal solution to a linear stochastic differential equation (SDE) always exists and can often be evaluated, providing valuable analytic solutions These solutions are essential for validating stochastic numerical integrators and assessing their error relative to the time step For instance, consider the non-homogeneous scalar linear Itô SDE with additive noise, expressed as dX = f(t) dt + dW, which is satisfied by its corresponding solution.

In the second example, a homogenous scalar linear It6 SDE with multiplicative noise, or is satisfied by the solution

To solve a non-linear system of stochastic differential equations (SDEs), one effective approach is to convert it into a linear form, leveraging the existence of formal solutions for linear SDEs Various techniques, such as non-linear transformations and the redefinition of a single non-linear SDE into a system of coupled linear SDEs, can facilitate this conversion For instance, the non-linear scalar Itô SDE dX = (aX + bX) dt + cX dW can be transformed into a linear SDE through the substitution y = x^n, thereby enabling the application of linear solution methods.

X, = ©; (xi +a(1 —n) | er! as) (2.79) with ©, = exp (- 3°) r+ew() | (2.80)

Stochastic differential equations may also describe a complex-valued stochastic process driven by real-valued Wiener processes.

Strong and Weak Solutions 2 ee 49

The formal solution of a general stochastic differential equation can be approached in two ways: by generating a trajectory through individual Wiener processes, resulting in a strong solution, or by determining a probability distribution that encompasses all potential trajectories, known as a weak solution Specifically, in the case of the generalized vector stochastic differential equation, if the drift vector is zero and the diffusion tensor is bounded and positive definite, the weak solution satisfies the Fokker-Planck equation, providing a comprehensive understanding of the system's behavior.

THẾ) - -hạy Aiken) + YY Spe (0| PX) (2.81) which is a partial differential equation with N dimensions.

Two distinct systems of stochastic differential equations can share the same weak solution while exhibiting different strong solutions For instance, a two-dimensional system that oscillates in a clockwise direction will yield a different strong solution compared to one oscillating counter-clockwise, despite both having identical weak solutions Strong solutions provide more detailed information than weak solutions By simulating multiple trajectories of each Wiener process and analyzing the ensemble of strong solutions, one can derive the distribution of that ensemble to obtain the weak solution However, it is generally not feasible to generate a strong solution by sampling a probability distribution of a time-dependent system, except in specific cases.

The strong solution of a general system of stochastic differential equations (SDEs) often lacks a formal solution, necessitating the use of stochastic numerical integrators to simulate the equation's trajectory A stochastic numerical method effectively approximates the SDE solution when the generated trajectory converges to the exact SDE trajectory, provided that the Wiener process paths are fixed and the time step approaches zero Specifically, if the time step of the stochastic numerical integrator is defined as Δt = t+ - ti, and the paths of all Wiener processes at times ti for i = 1, ,n are held constant, convergence can be achieved.

{Wj (t1),Wj(t2) Wj(t,)} for j = 1, ,M, then the exact trajectory of a d- dimensional stochastic differential equation can be evaluated at times 7; to be {X¢(t1) Xi (t2) ,Xe(tn)} for k = l, , đ.

The numerical approximation of the solution of the SDE that uses the same paths of the Wiener processes is {X\(n) X(0), ơ Ấy(„)} and is considered a strong approximation of the SDE if

A stochastic numerical method that effectively approximates the solution to a stochastic differential equation will converge to the exact solution in a path-wise manner This section will concentrate on stochastic numerical methods that produce strong solutions for non-linear stochastic differential equations.

2.4.4 ltô and Stratonovich Stochastic Integrals

The integral of the Wiener process can be understood through various interpretations, with the two primary definitions being the Itô and Stratonovich integrals These distinct definitions carry significant practical implications and are summarized for clarity.

Consider a time interval [0, 7] divided into n equal partitions 0 = t < ri 0.

To assess the strong accuracy of a stochastic numerical method, one can numerically solve a stochastic differential equation (SDE) with a known analytical solution This involves fixing the Wiener processes over a time interval [0, 7] with constant increments, where the path of a single Wiener process is defined as W(t) = {W(t1), W(t2), , W(tn)} The exact solution trajectory on this interval is represented as X = {X(t1), X(t2), , X(tn)} The numerical approximation is computed using a numerical scheme, yielding Y = {Y(t1), Y(t2), , Y(tn)}, which incorporates the time step and Wiener increments The strong error, denoted as €strong, is quantified using absolute terms, but since it is evaluated over a single Wiener path, it behaves as a random variable following a Gaussian distribution By repeating this process across multiple Wiener paths, one can calculate the mean of the Gaussian distribution, < €strong >, which aligns with the defined error equation relative to the time step As the number of evaluated trajectories approaches infinity, the variance of the Gaussian distribution converges to zero.

To assess the weak accuracy of a numerical scheme, we can compare the expected value of the exact solution of a stochastic differential equation with the expected value of the approximate numerical solution, E[Y(;)] = ÿ ”¡ Y(t) This comparison is conducted over N distinct Wiener paths, as outlined in Eq (2.115).

The numerical stability of stochastic numerical integrators is crucial, as instability can lead to the amplification of initial and roundoff errors, significantly impacting the accuracy of solutions This issue is particularly pronounced when addressing stiff stochastic differential equations characterized by a significant separation of time scales An N-dimensional linear stochastic differential equation (SDE) driven by M Wiener processes is deemed stiff if there is a substantial disparity between the minimum and maximum Lyapunov exponents of the drift and diffusion matrices, A and B, respectively Unlike ordinary differential equations, stiffness may emerge from time-scale separation in either the drift or diffusion components To assess asymptotic stochastic stability, one can implement a "test equation" with specific Lyapunov exponents into the numerical scheme and analyze the error generated in successive iterations Asymptotic stochastic stability is confirmed if the error in the (n+1)th iteration does not increase with n.

The function g in the equation g(Aằ rau: At) |" yx y0 x) (2.117) is influenced by the chosen numerical scheme and the test equation For Itô-Taylor explicit numerical schemes, asymptotic stochastic stability is guaranteed if the real part of the leading Lyapunov exponent meets the condition Re(A, )Ar < 1.

The Euler-Maruyama method is an explicit stochastic numerical technique derived from truncating the first Itô-Taylor expansion of the solution $X_t$ around $X_{t-1}$ This method focuses on the lowest order term in the expansion, providing a straightforward approach for solving stochastic differential equations.

Euler-Maruyama method is a single stochastic integral of strong order O (var) and weak order

O(Ar) Therefore, the strong order of accuracy is Y= 0.5 while the weak order of accuracy is

The Euler-Maruyama scheme is applied to a generalized N-dimensional stochastic differential equation driven by M Wiener processes, where the k-th component of Y is represented as yer) — y” ta, Att Ÿ bx In this context, AW; follows a normal distribution, specifically N(0,Art), for each Wiener process indexed by j from 1 to M Additionally, the term ay denotes the drift component of the k-th equation.

Exact and Numerical Solutions 0.8r 0.6 | h : AM” ! VU,

Figure 2.9: Using the Euler-Maruyama scheme, the numerical solution of the simple linear stochastic differential equation in Eq (2.119) is calculated using a time step of (red) At = 27’ or (yellow)

Implicit Stochastic NumericalSchemes .Ặ 61

A stochastic numerical method is implicit if the calculation of ¥‘"*!) involves the evaluation of the

Semi-implicit methods, often referred to as tai, evaluate drift coefficients at specific values, ensuring asymptotic stability across any time step This characteristic makes them particularly beneficial for stiff stochastic differential equations, where semi-implicit stochastic methods are frequently employed to enhance stability and reliability in computations.

For example, for a generalized N-dimensional SDE with M Wiener processes, the Ô-family for the semi-implicit Euler-Maruyama scheme is

The proposed scheme is defined by the equation YY sự” + fOan (Ke tet) + (18) an Kt) } Ar+ 3P by, 7 AW; (2.125) M j=l fork =1, ,N, where @ is a parameter within the range [0, 1] When @ equals 0, the scheme simplifies to the explicit Euler-Maruyama method Conversely, when @ is set to 1, it represents the fully semi-implicit Euler-Maruyama scheme Additionally, if @ is 0.5, the scheme takes on a trapezoidal-style format for stochastic differential equations.

Similar semi-implicit schemes are available for Milstein and higher-order methods, where the schemes are implicit solely concerning the drift coefficients In contrast, fully implicit schemes account for both drift and diffusion terms, but they often lead to complex expressions that include terms like Tư.

Recent research has introduced a family of balanced implicit methods that address the limitations of traditional approaches, which often fail to converge to the correct solution as the time step decreases These methods maintain an implicit formulation for both drift and diffusion coefficients, effectively eliminating the problematic Wiener increments from the denominator.

Adaptive Time StepSchemes 0 0 0.00.02 00004 62

Traditional stochastic numerical methods often rely on a constant time step, but adjusting the time step based on the stiffness of the stochastic differential equation (SDE) can be more effective While modern deterministic integrators typically use heuristic measurements to determine an “optimal” time step, this approach is less practical for SDEs since these measurements are random variables and may not accurately predict the error of subsequent iterations Instead, it is feasible to assess the numerical error after each iteration; if excessive error is detected, the numerical scheme can be reapplied with a smaller time step However, to achieve a strong solution to the SDE, it is crucial to reuse the previously generated Wiener paths at intermediate time points, which necessitates the use of a Brownian bridge to generate these intermediate points.

Research on adaptive time stepping schemes for stochastic differential equations is in its infancy A pioneering study in 1997 introduced a binary Brownian bridge for an adaptive time stepping scheme based on the Milstein method, demonstrating that strong convergence to the exact solution requires a stochastic numerical integrator with a strong order of accuracy greater than one Recent advancements have concentrated on creating new heuristic measures for numerical error Additionally, while a non-binary Brownian bridge can also be employed for adaptive time stepping, its implementation in multi-dimensional stochastic differential equations poses significant challenges.

The Wiener process exhibits fractal characteristics, allowing for continuous paths to be observed at any zoom level of the sample path, with a rescaled version of the original process A Brownian bridge serves as a mathematical method to create sample paths with progressively smaller time increments, maintaining fixed endpoints This approach enables the calculation of Wiener increments at intermediate values while utilizing the same sample path.

A binary Brownian bridge generates a Brownian tree by repeatedly halving the time increment of the sample path This tree consists of rows indexed by r = 0, 1, , R, where each r-th row contains 2^r elements Initially, the Brownian tree starts with a single row, known as the top row, which features the largest Wiener increment, ΔW = W(t_i) - W(t_{i-1}), determined by the initial time step.

At = ty —t„ The second row contains two smaller Wiener increments, AW? = W (tp) — W(t, + 3A?) and AW} =W(t,+ SA?) — W(„) where the time step has been halved.

In general, there will be 2” Wiener increments, each with a time step of Ai in the r” row To generate the additional Wiener increments, the relationships

AW; = SAWS tp” (2.126)— 1 and Aw: tT! _ I 2p = gAW, —y (2.127) are respectively used to create the even and odd Wiener increments of the (r + 1)" row in terms of the rTM” row with

The formula Y ~ N(0, 270E1) (2.128) applies for r = 1, , R rows and ứ = I1, 2, , 2 elements By utilizing the Brownian tree, it is possible to generate Wiener increments for a Wiener sample path, provided that the time step is consistently halved.

A Typical Adaptive Time Step Scheme

A typical adaptive time stepping scheme that uses a Brownian tree consists of the following steps:

1 The approximate solution, Y'"*!), is calculated using a stochastic numerical scheme with a strong order of accuracy, y > 1 and a time step Ar.

2 Using y("+1), the measurement of the numerical error is calculated via a heuristic expression.

3 If the numerical error is greater than some tolerance, the time step is halved At — A//2 If the corresponding row of the Brownian tree has not already been generated, then Eqs (2.126) and (2.127) are used to generate it.

4, The Wiener increments from the Brownian tree are then fed into the stochastic numerical integrator with the reduced time step to produce a more accurate solution Y("+!),

5 The time step selection loop repeats by going to step 2 until the generated numerical error is less than some tolerance value.

6 The procedure is repeated until a specified end time is reached: n — n+ 1

The selection of a heuristic method for measuring numerical error in stochastic differential equations (SDEs) is subjective and can involve various functions that assess the equation's stiffness Generally, distinct error measurements are utilized for both the drift and diffusion components of the SDE, as either the drift or diffusion term may independently contribute to the overall stiffness of the equation.

HyJCMSS: The Hybrid Jump/Continuous Markov Stochastic Simulator

Introduction 2 ee 64

Stochastic chemical kinetics has gained significant attention due to the inherently stochastic nature of biological systems By modeling chemically reacting systems as jump Markov processes, researchers can effectively capture the fluctuations caused by discrete reactions among dilute chemical species The foundational work of Gillespie introduced the stochastic simulation algorithm (SSA), which has two main variations: Direct and First Reaction Building on this, Gibson and Bruck developed the Next Reaction variant using specialized data structures and efficient random number generation More recently, Cao and colleagues demonstrated that the Direct Method of the SSA can outperform the Next Reaction variant for specific systems Although the SSA provides an exact solution for chemical dynamics, it becomes impractical for systems with a high frequency of reaction events, as the computational demand escalates with the number of reactions, particularly in fast reversible or enzymatic processes.

To tackle the challenges in simulating molecular reactions, several approximations have been developed Rao and Arkin utilize the quasi-steady state approximation for intermediate molecular species in the Master equation, achieving notable success in numerically simulating a reduced reaction system Gillespie introduces the tau-leap and kappa-leap methods, which treat fast reactions as Poisson distributed events, allowing for efficient jumps through rapid reaction occurrences Furthermore, by approximating the continuity of reaction events, Gillespie reformulates the system into a continuous Markov process and derives the chemical Langevin equation (CLE), a form of multiplicative Itô stochastic differential equation.

The system can be represented as either discrete-stochastic (SSA) or continuous-stochastic (CLE) through mathematical approximations However, if these approximations are invalid for certain subsets, numerical accuracy may be compromised To enhance accuracy and reduce computational costs, a hybrid method can be employed by using distinct mathematical descriptions for different subsets of the system and merging them self-consistently.

Haseltine and Rawlings introduce a stochastic hybrid method that categorizes a system into fast and slow reactions, modeling them as continuous and jump Markov processes, respectively They present two variants of the direct method for hybrid simulation: one that accurately calculates slow reaction waiting times and another that provides an approximation Their approach numerically integrates the chemical Langevin equation while determining slow reaction waiting times through a constraint based on the time-dependent probability density from the Direct variant of the stochastic simulation algorithm To enhance accuracy, they incorporate a 'no slow reaction' propensity in their implementation Additionally, Puchalka and Kierzek propose the 'maximal time step' method, which also divides the system into fast and slow reactions This method simulates slow reactions using the time-independent Next Reaction variant and approximates fast reactions via a Poisson distribution, with the simulation time step defined as the lesser of the next reaction time and a user-defined maximal time step.

This chapter introduces a hybrid method that effectively approximates the solutions for well-mixed chemical or biochemical reaction systems, particularly those with kinetics that operate in a stochastic regime Building on prior research, we partition the reaction system to enhance accuracy and efficiency in modeling these complex interactions.

This article explores the differentiation between "fast/continuous" and "slow/discrete" subsystems, approximating the fast/continuous reactions as a continuous Markov process The fast dynamics are described using the chemical Langevin equation We introduce a novel approach that employs the time-dependent probability density of the Next Reaction variant of the stochastic simulation algorithm to calculate the waiting times for slow reactions This method is constrained by a set of integral algebraic equations, termed 'jump equations.' By aligning these jump equations with a system of residuals, we can effectively compute the reaction times for each reaction by monitoring the zero crossings of the residuals over time.

We propose a numerical approximation that enables the execution of multiple slow reactions within a single integration of the chemical Langevin equation, enhancing computational efficiency with minimal accuracy loss While event times can be sampled from various time-dependent probability densities, we focus on the exponential distribution typical in stochastic chemical kinetics This approach results in a system of stochastic differential and jump equations that describe fast and slow reactions, solvable through various stochastic numerical integrators, specifically the Euler-Maruyama method for simplicity We anticipate that this numerical technique will be beneficial for simulating chemical and physical systems in the stochastic kinetic regime, particularly those involving both fast and slow reactions, such as biological systems where rapid enzymatic reactions influence regulatory proteins that bind to DNA to trigger critical events Whole cell simulators will require a stochastic hybrid method to accurately model the vast number of reactions, both fast and slow, occurring with reactant species present in varying concentrations.

Our hybrid method integrates representations of a coupled jump and continuous Markov process, allowing transitions in the jump Markov process to be defined by any computable distribution, including the Direct or Next Reaction exponential distributions Importantly, our approach is not restricted to a specific probability distribution, as all these distributions—such as ‘Direct’, ‘First Reaction’, and ‘Next Reaction’—are mathematically equivalent While the Direct distribution may offer computational advantages, as highlighted by Cao and colleagues, we opted for the Next Reaction distribution due to its beneficial properties for our application Specifically, we employ gamma-distributed reactions to effectively simulate transcriptional and translational elongation steps in gene expression, which is more complex to implement with the Direct method.

This chapter is structured into three key sections: it begins with an exploration of the fundamental theory behind system partitioning, the approximation of fast reactions, and the formulation of jump equations Following this, we present the algorithms for the proposed hybrid stochastic methods, addressing both scenarios with and without the multiple slow reaction approximation Finally, we validate the accuracy of these methods through various examples, showcasing their effectiveness in dynamic biological systems and comparing them to Haseltine and Rawling’s direct hybrid stochastic method using a simple crystallization case and large-scale benchmark models.

Theory 2 ee 66

Our system is characterized as a well-mixed volume, V, that encompasses N unique chemical species involved in M reactions The system's state, represented by the N-vector X, indicates the number of molecules of each chemical species The stoichiometric matrix, v, which is an MxN vector, outlines the reactants and products of these reactions Reaction propensities, denoted as a (an M-vector), represent the probabilistic rates of reactions, where ajdt signifies the probability of a reaction occurring These propensities can be calculated using various rate laws, including mass action kinetics or Michaelis-Menten kinetics Additionally, the time increment for numerically integrating the reactions within the framework of the chemical Langevin equation is denoted as Δt.

The system is dynamically partitioned into two subsets, respectively containing all of the “fast” or

In the study of reaction dynamics, reactions can be categorized as "fast" or "slow," with their classification determined by the ratio of fast to slow reactions The entire system is modeled as a jump Markov process, governed by a chemical Master equation A reaction is considered "fast" if it can be effectively approximated as a continuous Markov process, which is valid under two key conditions: the reaction occurs frequently within a short time frame, and the impact of each reaction on the quantities of reactants and products is minimal compared to the overall amounts of these species.

One can quantify these conditions so that, for the j” must be true: reaction to be classified as “fast”, the following aj(t)>A>1 (2.129) and

Xi(f) >£-|Vj|[ for i= {reactant or product of j”” reaction} (2.130)

The parameters A and € determine the frequency of reactions within adt and the granularity of reactant and product species required for continuous representation As A and € approach infinity, we attain the thermodynamic limit of the chemical Master equation, making the approximation precise However, practical values for these parameters remain finite.

In the simulation, the parameters € and A are set at 100 and 10, respectively As the reaction propensities and state vector evolve over time, it becomes necessary to reassess these conditions repeatedly, leading to the reclassification of each reaction as either "fast" or "slow."

The partitioning of the chemical Master equation into subsets has been previously discussed in the literature This article outlines the procedure and highlights the mathematical implications of limiting simulations to one or more slow reactions at a time.

The chemical Master equation of a system outlines the joint probability density of occurrences for fast and slow reactions, denoted as rÍ and r° This equation divides the joint probability density into two key components: the conditional probability of a slow reaction happening, represented as P(z°|rÍ;£), and the marginal probability of a fast reaction occurring, expressed as P(r‘;r).

Differentiating Eq (2.131) with respect to time yields dP(@*,rf:t) — P(r’ |r’;t)

= P(rht dt dt + dP(?:t) i Phyf) ( ) P(r’ Sst 2.132

When we limit our focus to performing one slow reaction at a time, we observe that the probability of this slow reaction, given a fixed number of fast reactions, remains constant throughout its waiting period Consequently, the derivative of this conditional probability is zero, indicating that it does not vary over time.

By substituting Eq (2.133) into the chemical Master equation, we derive the governing equation that describes how the probability density of fast reaction occurrences evolves over time, influenced by the constraints of slow reactions.

In scenarios where multiple slow reactions occur within a brief time frame, updating the conditional probability P(r*|r/;r) after each slow reaction while neglecting changes in dung results in an approximation of Eq (2.133) For small time increments, the influence of slow reactions on the probabilities of fast reaction occurrences remains minimal The upcoming example section will evaluate the precision of this approximation by examining various time increment sizes.

Deriving the Chemical Langevin Equation

We model fast reactions as a continuous Markov process and utilize the chemical Langevin equation to simulate the system's stochastic dynamics influenced solely by these reactions This equation is derived by approximating fast reaction occurrences as a Gaussian distribution, as previously established Alternatively, it can be obtained by reformulating the chemical Master equation with intensive variables and taking the thermodynamic limit, retaining only the first two moments of the resulting probability distribution We assess the validity of this approximation dynamically through specific equations Ultimately, this approach leads to a system of Itô stochastic differential equations characterized by multiple multiplicative noises.

The equation M fast M fast dX,= Ồ, Vv ji, (X(t)) dị Vie ya} (X(t) dW; (2.134) j=l j=l illustrates the dynamics of a system influenced by fast reaction propensities and a modified stoichiometric matrix, V, alongside a Wiener process, W, that introduces Gaussian noise While the chemical Master equation provides the probability density of the system at a specific moment, the chemical Langevin equation reveals a potential trajectory of chemical species over time Stochastic differential equations can be numerically integrated using various methods, with significant differences among them Notable techniques include the Euler and Euler-Maruyama methods, as well as advanced approaches like derivative-free stochastic Runge Kutta and adaptive methods Although we focus on the Euler method for now, future applications will likely leverage hybrid stochastic methods that optimize time increments for improved accuracy.

Deriving the Differential Jump Equations

After addressing the simulation of fast reaction subsets, we focus on simulating slow reaction occurrences by developing a system of integral algebraic constraints, known as "jump equations." These equations are termed "jump equations" because their solutions indicate the timing of transitions between states We introduce an efficient method for determining the timing of slow reactions, allowing us to effectively capture the time-dependent nature of slow reaction propensities while maintaining computational efficiency.

To explore potential timings for slow reactions, we utilize a Monte Carlo technique that aligns the integration of a time-dependent probability density with a uniform random number While it's possible to derive a jump equation from any probability density, not every density can yield a viable solution For the j'th slow reaction in the ath system, we can define a general probability density.

The probability of a slow reaction occurring within the interval [to + T, to + dt] is represented by the equation P)(t)las(t).X (ty) = Xne + -sX (to) sto )dt This probability is determined by the probabilistic rate a‘(t) and takes into account the entire history of the state.

The corresponding jump equation, governing the reaction times, is then to tT;

To analyze the reaction dynamics, we start with the equation $ Pu(x_j|a) = a_u X = X_{oite} dt - URN $ where $ f $ denotes the time since the last reaction, $ + $ represents the reaction time for the $ j^{th} $ slow reaction, and $ URN $ is a uniform random number in the range (0,1) By rearranging this equation, we express it in terms of a residual and establish an upper limit for the integral Consequently, we derive the relationship $ totP_j(t|a, X(t_n)) = X_a X(t_o) dt - URN = R $, with the initial condition of the residual defined as $ R = -URN_j $ Furthermore, we note that the probability density $ P(t) $ is monotonically non-decreasing over a small time increment $ \Delta t $, leading to the formulation $ tot(k+1) \Delta t $ in relation to $ tot(k) \Delta t $.

| P(t;)dt" > | P(tj)dr vk>0 (2.137) to to then one can determine whether the j” slow reaction has occurred by simply monitoring the sign of the residual, according to

Rj' =0— f=7;, where f is the current simulation time.

Algorithms 2 0 cuc c cv ng g v kg vi kg kg xà 71

The HyJCMSS algorithm is developed in Fortran95, incorporating Haseltine and Rawling’s algorithm, which features a scaled stochastic time step and addresses the absence of slow reactions Both algorithms utilize consistent data structures, such as an indexed priority queue and a dependency graph, alongside the Next Reaction variant of the stochastic simulation algorithm.

The HyJCMSS method serves as a foundational "exact" approach, which we will outline in detail This article includes selected code snippets to demonstrate the implementation of each step, focusing specifically on fixed step stochastic numerical integrators for enhanced clarity.

The state vector X(t) at time 7 represents the system's compactness, while a(t) denotes the vector of reaction propensities at time ¢, and a(X) refers to the reaction propensity when the state vector is X The initial quantity of chemical species is denoted as X,, Key terms such as stochastic simulation algorithm (SSA), stochastic differential equation (SDE), chemical Langevin equation (CLE), and uniform random number (URN) are introduced The HyJCMSS algorithm is structured with an initialization routine followed by a while loop, which will be explained in detail.

The Initialization Subroutine: The initial condition X,, the initial time f,, and the reaction network’s kinetics, rate laws, and stoichiometries are given.

1 Using the reaction network details, the rates a(X) are calculated.

2 The reaction network is partitioned into fast/continuous and slow/discrete subnetworks using the A and € parameters according to Eqs (2.129) and (2.130) The rates of the fast/continuous and slow/discrete reactions are classified as aƒ/(X) and a°(X), respectively The list of all slow/discrete and fast/continuous reactions are defined as L°D and LY°,

3 The reaction residuals for all slow/discrete reactions are reset: R; = log(URN,) for all j € +3”.

4 The times of the next slow/discrete reactions are calculated using +; = —R;/a;(X) +t for all je £ŸP and a;(X) > 0 For alla; =0, 1; =.

5 The minimum reaction time, t, = min, (t), is determined via an indexed priority queue.

The Propagator Subroutine takes the current state of a system, X(t), and a specified end time, fs, to return the future state, X(fs) It operates under the assumption that the Initialization subroutine has been executed, utilizing the kinetics, rate laws, and stoichiometries of the reaction network Additionally, a dependency graph of the reaction network is provided, and the process continues in a loop until a specified condition is met.

1 Treat any previous fast/continuous reactions that are now classified as slow/discrete: Re- compute the corresponding reaction times using +; = —ẹ;/a;(X) +? and resort the indexed priority queue Do not reset the reaction residuals of reclassified reactions.

2 Treat any previous slow/discrete reactions that are now classified as fast/continuous: Set the corresponding reaction times to machine infinity and resort the indexed priority queue.

3 If there are any fast/continuous reactions in the network, then continue Otherwise, execute the stochastic simulation algorithm to calculate dX ¿„„ and continue to step 11.

4, Select an initial Afspr If fs¿p —t < Afspg then choose Afspr = tstop — f.

5 Execute all slow/discrete reactions occurring during the interval (t,t + Arspz unless a special event occurs first Initialize: dX = 0, counter = 0, tras) = 7, and SpecialEvent = FALSE ¥4’slow

6 Perform the following loop while t, > f+ Ar and no special events have occurred: a) Select the minimum reaction time, T,,,,, from the root node of the indexed priority queue. b) If (Xi stow +Vui)/Xi > MSRroi for any i? species affected by a fast/continuous reaction and counter > 1, then SpecialEvent = TRUE., set Afsprg = tras; —t, and go to step 6. Otherwise, continue. c) Execute the y” slow/discrete reaction: dX, = AX sjo +y,, and increment counter. d) Numerically integrate the differential jump equations on the interval (7„„ T,,) to update the reaction residuals of the slow/discrete reactions: R; = Rj + đ) * (Ty — fras) for j Au.

To update the rates of slow or discrete reactions, utilize the dependency graph to identify necessary adjustments Calculate the updated reaction times using the formula +; = —R;/ a; (X) + trusr, ensuring that a; (X) is greater than zero Organize these updated reaction times using an indexed priority queue, then repeat the process from step 6 to ensure accurate results.

7 Update the reaction residuals for the “leftover” interval (tzas;,¢ + Afspz) using R; = Rj; + a)(Ä) * (t+ Atspe — trast )

8 Numerically integrate the CLE in Eq (2.133) for the interval (t,t + Ar) using a stochastic numerical integrator Calculate dX;„„.

10 Update the state of the system: X(t + Arspz) = X(t) + dX pas + dX slow'

11 Partition the system of reactions into fast/continuous and slow/discrete reactions again.

In scenarios where fast or continuous reactions are absent, the Stochastic Simulation Algorithm (SSA) is employed, specifically through a conversion to the differential Jump equations This transformation maintains the original reaction propensities, allowing for flexible reaction classification without losing any residual reaction potential When transitioning to the SSA subroutine, it is crucial to preserve the values of reaction residuals to ensure that the partial occurrences of slow or discrete reactions are accounted for Utilizing differential Jump equations simplifies the implementation of the SSA's Next Reaction variant, particularly when reaction propensities may reach zero, thus eliminating complications associated with saving reaction times for future occurrences For further insights, refer to the original work by Gibson and Bruck on the implications of zero reaction propensity.

The Stochastic Simulation Algorithm using the differential Jump equations:

1 Select the minimum reaction time, t,, from the indexed priority queue.

2 If tT > tsrop, then no slow/discrete reactions occur Numerically integrate the differential Jump equations from f to fs;¿; Set f = fs;¿; Otherwise, continue.

3 Numerically integrate the differential Jump equations from / to t, using Rj = Rj + đ) (X) * (t, —t) for j # ws Reset the ¿” reaction residual to log(U RM,).

4, Execute the 4” slow/discrete reaction Update the state: X = X +Y„ Update the time: 7 = Tụ. Update the slow/discrete reaction rates, a°(X).

5 Update the slow/discrete reaction times, 1; = —R;/aj(X) +t, where zj(X) > 0 Otherwise,

6 Sort them using the indexed priority queue Go to step 11 of the HyJCMSS propagator subroutine.

Examples, Error Analysis, and Critical Comparisons

All of the following simulations were performed on a Sun Ultraspare workstation with a 950Mhz processor and 6 GB of RAM.

The "Cycle Test" is designed to evaluate the numerical accuracy of proposed algorithms and highlight the advantages of a hybrid method for simulating systems with multiple time scales This test involves varying the initial conditions of species and the kinetic constants of slow reactions across different system sizes, denoted as ©, to enhance the separation between fast and slow reaction rates Fast reactions are labeled 1 to 3, while slow reactions are numbered 4 and 5, with the latter exhibiting average rates of 0.75 and 1 molecule per second, respectively, across all system sizes Notably, the rates of fast reactions increase as the system size expands.

The Cycle Test, at system sizes of 100, 1000, 10 000, and 100 000, is simulated with an end time of 100 seconds, using the stochastic simulation algorithm (SSA), the HyJCMSS method with

Table 2.3: Cycle Test reactions and parameters

Ao = © molecules, B, = 2 © molecules, C, = 3 © molecules, D, = E, = 0 molecules

Figure 2.13: Comparison of the (A) mean and (B) variance of the Cycle Test with a system size of

100, using the (lines) stochastic simulation algorithm and the (dots) HyJCMSS method without the Multiple Slow Reaction approximation.

The study examines two methods: the HyJCMSS with MSRr set to 0 and the HyJCMSS+MSR with MSRy set to 1 A stochastic numerical integrator is employed, utilizing a maximum time step of 0.1 seconds Table 2.4 presents the average computational run times per trial, normalized against the SSA A total of 10,000 independent trials are conducted for system sizes of 100, 1000, and 10,000.

Figure 2.13 illustrates the mean and variance of the solution for a system size of 100, utilizing the Next Reaction hybrid and SSA methods Additionally, Figure 2.14 presents the weak mean and variance errors for representative species A and D across all three methods, with weak mean and variance errors defined accordingly.

The probability distributions of species in the Cycle Test with a system size of 100 at 5 and 40 seconds are illustrated in Figures 2.15 and 2.16, highlighting a comparison between the SSA and Next Reaction hybrid methods This comparison is particularly beneficial when the solution deviates from a Gaussian distribution Notably, species D and E do not exhibit a Gaussian distribution; however, the proposed hybrid method effectively captures their heavy tail behavior, which is crucial for accurately simulating rare but significant events.

As the Cycle Test system size grows, the stochastic simulation algorithm demands more time to manage the rising number of reaction occurrences In contrast, the proposed hybrid methods effectively approximate the accelerating reactions as a continuous Markov process, maintaining consistent running times across varying system sizes Consequently, as the system size expands, the efficiency of the HyJCMSS remains unchanged.

Species D ch n hân 1 ` ais f yÈ A]

The Cycle Test results for species A and D are illustrated in Figure 2.14, showcasing weak mean and variance errors The data is presented for system sizes of 100 (light/dotted), 1000 (dark/dashed), and 10,000 (dark/solid) The analysis employs the HyJCMSS method for panels A and B, while panels C and D utilize the HyJCMSS method with the Multiple Slow Reaction approximation.

Relative Probability Relative Probability o = on Ầ oS = wn

Figure 2.15 illustrates the probability distributions for species A, B, and C on the left, and species D and E on the right, derived from the Cycle Test The analysis is based on a system size of 100 and a time frame of 5 seconds, comparing the results of the Stochastic Simulation Algorithm (SSA) represented by lines and the Hybrid Jump Continuous Markov State Space (HyJCMSS) method indicated by dots, without utilizing the Mean State Representation (MSR) approximation.

Relative Probability °o an ace fon) 20 30 40 50

The probability distributions of species A, B, C, D, and E in the Cycle Test, conducted with a system size of 100 over 40 seconds, highlight the efficiency of the SSA and HyJCMSS methods without the MSR approximation Figures 2.13 and 2.14 confirm the accuracy of the HyJCMSS method, both with and without the MSR approximation, by focusing on the differences in probability distribution moments Additionally, Figures 2.15 and 2.16 demonstrate the precise reconstruction of full probability distributions, including non-Gaussian distributions like those of species D and E, particularly at the 5-second mark, where the distribution exhibits a heavy tail These results are achieved with significantly reduced computational demands and a speed improvement by several orders of magnitude The multiple slow reaction approximation, which allows reactions 4 and 5 to occur multiple times within a single time increment, enhances efficiency by approximately 21 times.

The Pulse Generator serves as a simplified model for understanding the molecular mechanisms behind the circadian rhythm in Drosophila fruit flies It comprises two genes, G1 and G2, which produce monomer proteins P1 and P2, respectively These monomers combine to form a dimer, P1:P2, capable of binding to operator sites on genes G1 and G2, thereby repressing their expression Each gene contains three operator sites, each binding one dimer, and exhibits cooperativity, allowing subsequent dimers to bind more rapidly than the first Gene expression occurs when no dimers are bound to the operator sites.

A kinase enzyme, referred to as E, can phosphorylate both monomer and dimer proteins, leading to their rapid degradation by cellular proteolytic systems However, the enzyme exhibits a significantly lower catalytic rate on dimers compared to monomers, resulting in a classic Michaelis-Menten competitive inhibition scenario Once phosphorylated, both dimer and monomer undergo immediate degradation The kinetic parameters for monomer and dimer phosphorylation are 10* and 10~* [sec]~! for k“ and 20 mM and 10 µM for K,,, respectively Detailed reactions and kinetic coefficients are provided in Table 2.5 This molecular mechanism produces oscillatory gene expression for genes G1 and G2, characterized by constant amplitude and fluctuating periods due to the probabilistic binding and unbinding of dimers at the operator sites of the genes.

In biological systems like the Pulse Generator, enzymatic and dimerization reactions dominate the frequency of reactions The Stochastic Simulation Algorithm (SSA) primarily focuses on detailing these individual reaction occurrences, which can be time-consuming In contrast, hybrid methods simplify these reactions by treating them as 'fast' and modeling their dynamics with stochastic differential equations In the Pulse Generator, the simultaneous production of monomers directly facilitates dimerization and the rapid repression of both genes Although these monomers are present for brief moments, the SSA method largely dedicates its computational resources to tracking their occurrences.

Table 2.4: Ratios of Computational Run Times of Cycle Tests

System Size 755A /THICMSS 755A /THYICMSS+ MSR

Table 2.5: A Simplified Model of the Pulse Generating Gene Network in Drosophila Circadian Rhythm

2 G1l* — G1 +20 Pl 100 Ist 16 | G2:3P1:P2 —ơ G2:2P1:P2 + PI:P2 le-3 Ist

5 Pl + P2 — PI:P2 2e8 2nd 19 PI+E—PI:E 5e5 2nd

6 PI:P2 — Pl +P2 2.0 Ist 20 PLE—PI+E 10.0 Ist

7 G1 + P1:P2 — G1:P1:P2 le4 2nd 21 PIE—=E led Ist

10 G2 +PI:P2 — G2:P1:P2 le4 2nd 24 P2:E -E led Ist

12 | G2:2P1:P2 +PI:P2 — G2:3P1:P2 3e4 2nd 26 PI:P2:E — PI:P2+E 5.0 1st

Volume = 10! Liters 1st= [sec]~! 2nd =[M sec]!

The study initiates with the conditions of #G1, #G2 set to 1, and #E at 100 molecules, focusing on dimerization reactions The introduced hybrid methods efficiently partition these reactions into a fast reaction subsystem, enabling a significantly larger time increment for numerical integration compared to the reaction times Figure 2.17 illustrates the oscillatory behavior of monomer species alongside the computational times for both the HyJCMSS and SSA methods Notably, the SSA method experiences sharp increases in running time due to frequent dimerization reactions, which lead to the rapid degradation of monomers and dimers The typical running times recorded are 1898 seconds for SSA, 160 seconds for HyJCMSS, and 147 seconds for HyJCMSS with the MSR approximation, highlighting an efficiency improvement of 11.9 to 12.9 times with the proposed hybrid method.

The Pulse Generator reaction system illustrates the benefits of dynamic partitioning into fast and slow reaction subsets While fast dimerization reactions comprise only a small fraction of the total simulation time, the Stochastic Simulation Algorithm (SSA) typically spends most of its time addressing these events In contrast, the proposed hybrid algorithms efficiently categorize reactions into fast and slow subsets, utilizing the SSA when no fast reactions are present and employing the Next Reaction hybrid method otherwise Notably, the HyJCMSS method significantly reduces overall running time by rapidly simulating systems with fast dimerization reactions, outperforming the traditional SSA approach.

Comparison with the Haseltine and Rawlings Method

The Direct hybrid method developed by Haseltine and Rawlings employs the chemical Langevin equation to approximate fast reactions while permitting only one slow reaction per time step This approach leverages the probability distribution from the 'Direct' variant of the stochastic simulation algorithm Our numerical implementation incorporates a scaled stochastic time step with a 'no slow reaction' propensity of 10.0 seconds We first compare this method against a previously analyzed simple model of crystallization, focusing on the accuracy and efficiency of each hybrid method Additionally, we conduct a second comparison using a benchmark model of a large-scale system.

Time (secs) xi0ˆ Time (secs) x10"

The oscillatory dynamics of monomers P1 and P2 in the Pulse Generator reaction system are illustrated, showcasing their performance alongside the computational running times of the HyJCMSS method and SSA The right Y-axis indicates the processor time in seconds needed to simulate the system, highlighting the efficiency of each hybrid method when applied to complex systems that include both fast and slow reactions.

An Equation-Free Probabilistic Steady State Approximation

Introduction © ee 85

The stochastic simulation of systems of chemical or biochemical reactions has become an important tool in quantitatively describing the behavior of ‘small’ chemical biochemical systems

Simulations of jump Markov processes with discrete states offer a precise mesoscopic description of various physical and chemical phenomena, including reaction and diffusion While original and improved stochastic simulation algorithms provide exact solutions for well-stirred systems of biochemical reactions, their computational costs escalate with the number of reaction events This poses challenges for simulating systems with both fast and slow reactions, typical in biological organisms where rapid enzymatic reactions influence the regulation of infrequently expressed genes Reducing the computational costs of these methods is crucial, as it significantly enhances their application in sensitivity analysis, global optimization, and the generation of bifurcation diagrams for mesoscopic and microscopic systems.

To reduce the computational cost of stochastic simulations, several approximations have been introduced Fast reactions can be modeled as discrete Poisson or binomial processes, allowing for the execution of reaction occurrences in "bundles" based on these distributions Additionally, fast reactions involving numerous molecules may be effectively approximated using a continuous Markov process, described by a chemical Langevin equation Recent advancements have led to hybrid methods that integrate these approximations with the original jump Markov process, with the latest research quantifying global error and proposing a new approximation to optimize simulations involving thousands of reactions.

Stiffness in stochastic simulators, particularly in fast or continuous reactions, remains a common challenge This issue can be addressed by approximating these reactions as a continuous Markov process, leading to the formulation of stiff stochastic differential equations (SDEs) These SDEs can then be numerically integrated through adaptive methods, offering a solution to the problem of stiffness in simulation.

Recent advancements have been made in developing methods for Poisson-driven stochastic differential equations (SDEs), yet the foundational numerical theory remains in its early stages To effectively tackle the issue of stiffness in these equations, it is essential to create a method that either enhances or replaces current approaches.

A common approach to address stiffness in differential equations is to utilize the quasi steady state approximation (QSSA), which simplifies a subset of equations into non-linear algebraic constraints by assuming rapid convergence to a steady-state This method has been effectively applied to the chemical Master equation, where selected intermediate species in small enzymatic reaction networks are eliminated, leading to time-independent distributions and familiar Michaelis-Menten kinetics Recent methodologies by Cao, Gillespie, Petzold, and Goutsias further enhance this technique by partitioning reaction systems into 'fast' and 'slow' components, streamlining the analysis of complex biochemical processes.

Recent advancements in computational methods allow for the application of a probabilistic steady state approximation (PSSA) to jump Markov processes, significantly reducing simulation costs While the exact solution for linear chemical systems can be determined using the chemical Master equation, non-linear systems pose challenges due to the lack of generalized solutions To address this, we propose an equation-free approach that generalizes these concepts for any chemical or biochemical reaction system, including non-linear dynamics This method dynamically applies the PSSA when applicable, computes the steady state distribution from sampled trajectory data, and accurately simulates subsequent slow reactions, effectively advancing simulation time without assuming a specific form for the steady state distribution, only requiring it to be ergodic.

The method is termed ‘equation-free’ as it avoids solving traditional evolution equations, such as chemical Master or differential equations, to compute the quasi steady-state probability distribution Instead, it employs a kinetic Monte Carlo method to simulate the forward stochastic dynamics of the jump Markov process, allowing for the detection of a quasi steady state marginal distribution This approach enables direct sampling of states from the distribution, eliminating the necessity for moment truncations or other approximations.

This method enhances kinetic Monte Carlo algorithms or stochastic simulators that model system dynamics, particularly for chemical or biochemical reactions described as jump Markov processes It is particularly effective for systems with rapid reactions that lead to a stable probabilistic steady state The paper begins with a simple example illustrating the dynamic application of a probabilistic steady state approximation, then generalizes these principles for coupled reactions Four case studies are examined, comparing computational costs and accuracy against an optimized stochastic simulation algorithm The discussion concludes with an evaluation of the method's limitations.

Consider the following non-linear toy reaction network in a bacterial-sized volume of 10~!> Liters with an initial condition of #A = 45 molecules, #B = #C = #D = 25 molecules, and #E = 0 molecules:

A+B—=C_ kị = 1.3284 [molecules sec]! RI C—A+B kạ jsec| ! R2 (2.147) C+-D—E kạ=3.32 x 10+ [molecules sec]! R3

While this is just an example, this system has many commonalities to real biological systems. Species C and E may be multimer proteins of monomer proteins A, B, and D Protein heterodimer

C binds to the mRNA regulatory binding site D, creating a complex E This system involves a limited number of molecules, allowing us to model it as a jump Markov process The reactions R1 and R2 are characterized by their reversibility and rapid transitions, whereas reaction R3 occurs at a comparatively slower rate.

The stochastic simulation algorithm primarily focuses on resolving fast reactions, often neglecting slower ones When applying Poisson or binomial approximations, the algorithm's inherent time step becomes crucial for accurate modeling.

Figure 2.19: A stochastic simulation trajectory of the dynamics of reactions RI through R3 for

In a simulation lasting 1000 seconds, species A, B, and C are represented by light lines, while species D and E are indicated by dark lines, with vertical dotted lines marking the time intervals The simulation's speed is inversely related to the sum of the reaction propensities, leading to slower simulations when numerous fast reactions are present To analyze the long-term behavior of this simple system, significant computational time is required to accurately resolve both fast and slow reactions Figure 1 illustrates a trajectory of the system generated using the stochastic simulation algorithm.

Species evolve over time through two primary time-scales: the rapid reactions, RI and R2, and the infrequent reaction R3 Following each occurrence of reaction R3, the fast time-scale facilitates a swift transition for species A, B, and C to attain a probabilistic steady state This dynamic results in a stable population of molecules for species A, B, and C.

B, and C change over time, but the probability distribution of the number of molecules are roughly constant for a large period of time Just after each occurrence of reaction R3, the species A, B, and C are no longer at a probabilistic steady state, but may reach a new steady state within a short period of time If the occurrences of reaction R3 are rare then there may be a significant period of time when the species A, B, and C are at a probabilistic steady state, such as the time interval (t;,t4) labeled in Fig 2.19 In this region, about 300,000 uninteresting reaction events are executed by the stochastic simulation algorithm, wasting computational time There are many of these regions in this simulation and, by eliminating extraneous reaction occurrences, we may reduce the computational cost significantly Note that we are loosely using the term probabilistic steady state as when the distribution of some subset of species becomes insensitive to time, making it more of an approximate steady state We will formalize the definition of this approximation in section 2.6.3.There are three important regions labeled in Figure 2.19: the relaxation period, the sampling period, and the leapfrog period Reaction R3 fires at time 7¡, causing any previous probabilistic steady state approximation (PSSA) to be invalid Between times /¡ and f, the marginal distribution of species A, B, and C is both stable and quickly relaxing to a probabilistic quasi steady state The marginal distribution moves quickly from a time dependent one to a stationary one At time f2, we start to sample the steady state probability distribution Once we have collected enough samples to accurately determine the time of the next firing of reaction R3, which is t4, we leap ahead to that time and ignore any occurrences of reactions R1 and R2 between times f3 and t4 By repeating this process, we can skip over many occurrences of reactions RI and R2 while still accurately resolving the occurrences of reaction R3 The important questions are as follows: how do we identify when a probability distribution is stable and converges to a quasi steady state; how many samples from the distribution are required to accurately compute the time of the next slow reaction; and how do we choose the state of the system at any time in the leapfrog period and at the time just prior to the occurrence of the next slow reaction We will present answers to these questions in the next section.

In a system characterized by M chemical or biochemical reactions and N unique species, the state is represented by a vector X(t), indicating the integer count of each species at time t This system behaves as a jump Markov process, starting from an initial condition X(0) = X₀ The reaction propensities, denoted as a, form a vector of positive functions that determine the likelihood of each reaction occurring in a small time interval dt The stoichiometric matrix V outlines the changes in species counts resulting from each reaction Each reaction has an associated time, and the next reaction to occur is identified as the j-th reaction The occurrence of this reaction transitions the system state from X(T) to X(T) + Vj, where T represents the time of the j-th reaction To analyze reaction occurrences, we define Ri(t, t') as the count of the j-th reaction within the time interval (t, t').

To apply a valid probabilistic steady state approximation (PSSA), we begin by partitioning the system of reactions into fast and slow subsets, assessing the separation of time scales between their dynamics This partitioning relies on the likelihood of reactions driving their species toward a quasi steady state marginal distribution By counting the occurrences of each reaction, we dynamically evaluate the time scale separation, identifying the optimal conditions for PSSA application When the approximation holds, we assume the steady state distribution is ergodic, allowing us to sample the system's state over time and compute its future states based on those samples.

Theory 2 0 ee 89

In a system with M chemical reactions and N unique species, the state is represented by an N x 1 vector, X(t), indicating the number of molecules of each species at time t This system behaves as a jump Markov process with an initial condition, X(0) = X(0) The reaction propensities, represented by a vector a, consist of positive functions that determine the likelihood of each reaction occurring within a small time interval dt The stoichiometric matrix, V, defines the changes in species concentrations resulting from each reaction Each reaction is associated with a reaction time, and the next reaction is identified as the one with the shortest time to occur The occurrence of a reaction transitions the system's state, and to analyze specific reactions, we define Ri(t, t') as the count of occurrences of the j-th reaction within the time interval (t, t').

To apply a valid probabilistic steady state approximation (PSSA), we first categorize reactions into fast and slow subsets, assessing the separation of time scales between their dynamics This partitioning relies on the likelihood of reactions leading to a quasi steady state marginal distribution By counting reaction occurrences, we dynamically evaluate the timescale separation, identifying optimal conditions for PSSA application When valid, we assume the steady state distribution is ergodic, allowing us to sample the system's state over time and compute its future states.

In the context of random dynamical systems, terms like "quasi steady state marginal distribution" and "probabilistic steady state approximation" serve as alternatives to deterministic concepts, highlighting distinct notions of stability and timescale separation We adopt Arnold and Schmalfuss's definition of a random attractor on a compact random set and analyze its stability through a "pull back" perspective, where trajectories converge to the attractor in probability In multi-dimensional systems or those with various chemical species, the marginal distribution of specific species may rapidly approach a quasi stable attractor, which remains asymptotically stable in a lower-dimensional space while still evolving in the larger context This concept, referred to as the "quasi steady state marginal distribution," is sometimes simplified by omitting "quasi." The probabilistic steady state approximation leverages this steady state marginal distribution to assess the slow dynamics without requiring full simulation of the fast dynamics.

This article outlines the criteria for partitioning and the conditions necessary for detecting the convergence and stability of a steady-state marginal distribution It also discusses methods for computing the system's state at future times, referencing the times and regions illustrated in Figure 2.19 Specifically, it defines t1 as the time of the previous slow/discrete reaction, t2 as the moment the probability distribution approaches its steady state, t3 as when the fast/discrete reactions cease, and t4 as the time of the subsequent slow/discrete reaction.

The method aims to efficiently simulate reaction networks while maintaining accuracy, focusing on reactions that lead participating species to quickly reach a quasi-stable marginal distribution While the entire system eventually converges to a steady state over infinite time, our primary interest lies in species that relax to a marginal distribution in a short, finite timeframe Species that rapidly converge are typically those frequently influenced by reaction events and initially close to their steady states To accurately reconstruct these distributions with fewer samples, the sample space must be sufficiently small By concentrating on these specific distributions, we reduce the time needed to detect stability and convergence, as well as the number of samples required.

To determine the appropriate marginal distribution, we categorize the system into fast/discrete and slow/discrete reaction subsets based on the reaction rates and the number of molecules involved Faster reactions lead to a quicker convergence of participating species to a steady state, while reactions involving discrete species with fewer molecules have a more significant impact over shorter timeframes, facilitating faster convergence Therefore, our focus is primarily on applying the PSSA to dilute species influenced by rapid reactions.

Throughout the simulation, all reactions are dynamically reclassified based on their behavior within a specified time interval, denoted as (t1, t4) If a fast or discrete reaction fails to meet the established criteria during this interval, it is excluded from the fast/discrete subset Specifically, a reaction, labeled as j, is classified within the fast/discrete subset FD(t1, t4) if it consistently satisfies the condition aj(t) > A for all times t within the interval (t1, t4).

In the context of reaction kinetics, a reaction is classified as 'fast' if it meets the criteria outlined in Eq (2.148), where the rate, denoted as À [molecules/sec], must exceed a specific threshold, and the sample space is limited to € [molecules] Fast or discrete reactions are characterized by the presence of at least one reactant or product that maintains a dynamic count of fewer than € molecules throughout the reaction duration For reactions involving multiple species, it suffices for just one to be discretely valued, ensuring that the reaction extent remains constrained within € / |v;;| Conversely, reactions failing to satisfy these conditions are deemed slow or discrete.

In the study by Cao et al [21], species are categorized into 'fast' (XÍ) and 'slow' (XŸ) groups within the time interval (t, t4) The state vectors for these species are represented as X and X° A species is deemed 'fast' if it is influenced by any rapid or discrete reaction; otherwise, it is classified as 'slow' This classification is determined by the condition that for the i-th species, if Aj belongs to the set of fast reactions FD(t, t4), then it is considered fast, provided that Vị is not equal to zero.

If a species is affected by both fast/discrete and slow/discrete reactions it is classified as a fast one.

Fast/continuous reactions can occur when the participating species have a significant number of molecules In such cases, we categorize these reactions separately and utilize the Hybrid Jump/Continuous Markov Process Stochastic Simulation (HyJCMSS) method to analyze their dynamics This article specifically examines systems that involve only slow/discrete and fast/discrete reactions.

Stability and Convergence of an Unknown Distribution

The complete joint probability distribution, P(x!, X'), may take a significant amount of time to reach its stable state To analyze this, we can decompose the joint probability distribution into its conditional and marginal components, allowing us to examine their behavior over the relevant time interval (7¡, r4) This approach is based on the methodology proposed by Rao and Arkin [72].

The equation P(X/.X'::|X(n).i) = P(X`:t|XỈ) x P(XỈ:t X(t1),t1) illustrates the relationship between the conditional probability distribution of the slow species and the marginal probability distribution of the fast species This allows for the independent treatment of both distributions.

The conditional probability distribution of slow species evolves gradually over time, primarily due to the rarity of slow or discrete reactions Given the infrequent nature of these reactions, we can assert that if no slow reaction occurs within a specified time interval, the trajectories of the slow species remain unchanged from their values at the initial time Thus, the conditional probability distribution of the slow species, conditioned on its state at the initial time, is effectively stable.

The conditional probability distribution, represented by P(X'::|Xf.Xứ), is defined by the equation P(X'::|Xf.Xứ) = ð(X(n)) — Yre(n.4) 2.151, which holds true under specific conditions When the condition in Eq (2.151) is satisfied, it indicates that the conditional probability distribution remains nearly constant over extended time intervals However, we cannot determine the manner in which this distribution approaches its steady state; we can only assert that within the time interval (t),t4), the distribution aligns precisely with the delta function.

We now examine the stability of the marginal distribution and the convergence to a steady state distribution, referred to as t2 The stable marginal distribution is characterized as a time-independent stationary distribution, exemplified by Eq (2.152).

P(X! rlX(n).0) = PS(X|X(t)) Vte (to, t4) (2.152) or a time-invariant distribution, as in Eq (2.153)

P(X! t|X(t1).t1) = PŠ(Xf, 2|X(n)) Yre (12,14) (2.153) where the distribution oscillates with some constant frequency, f Time-invariant distributions may arise from oscillatory chemical or biochemical reaction systems.

Numerical Implementation 1 0.0.0 00 ee ee ee 96

The numerical method enhances kinetic Monte Carlo simulations of chemical and biochemical reaction dynamics by utilizing the Next Reaction variant of the stochastic simulation algorithm This approach employs the probabilistic steady state approximation (PSSA) under specific sufficient conditions, while reverting to the standard stochastic simulation algorithm when those conditions are not met We provide a pseudocode sketch to demonstrate the straightforward integration of this method into existing stochastic simulation frameworks.

The state variable $ X $ and its initial condition $ X_0 $ are defined as $ (N \times 1) $ integer variables, while the current time $ t $, initial time $ t_0 $, and next reaction time $ t_{next} $ are real scalar variables Reaction propensities $ a $ and reaction times $ t $ are represented as $ (M \times 1) $ real array variables, and the stoichiometric matrix $ V $ is an $ (M \times N) $ integer array variable The identity of the next occurring reaction $ O $ is an integer scalar variable, accompanied by two logical scalar variables, $ PSS $ and $ FastRxnsOff $ The reaction occurrence counter $ RxnCounter $ is an $ (M \times 1) $ integer array, while the number of saved samples is tracked by the integer scalar variable $ SaveCounter $ The implementation is carried out in Fortran95 within a Linux environment, utilizing a heap sort to identify the minimum reaction times and constructing a dependency graph to recalculate reaction propensities and times only when necessary, with a sampling period set to one for simplicity.

2 Reaction propensities, a, are calculated and used to compute the reaction times, t, which are sorted minimum-top in the heap

3, PSS and FastRxnsOff are set to FALSE, and SaveCounter is set to the first position

Time iterative loop (stop at desired time):

1 Classify each reaction as fast/discrete or slow/discrete according to Eq (2.148).

2 Determine the identity of the next reaction, yu, from the heap sort

3 Determine the time of the next reaction, T,,, from the top position of the heap sort

+ Execute the next reaction occurrence in the following FastRxnsOff-dependent way:

5 If FastRxnsOff is TRUE., then do the following: e X is first sampled from the distribution in Eq (2.172) Then, the y” reaction is executed with X =X + Vụ, e The time is updated, t = Tụ. e The fast/discrete reaction propensities and times are recomputed The affected slow/discrete reaction propensities and times are recomputed The heap is resorted.

6 If FastRxnsOff is FALSE., then do the following: e The state is updated, X = X + Vụ. e The time is updated, t = Tụ. e Both the fast/discrete and slow/discrete reaction propensities and times, where affected,are updated and the heap is resorted

To ensure the convergence of certain marginal distributions to a quasi-steady state with each reaction occurrence, increment the RxnCounter if the reaction is fast or discrete Conversely, if the reaction is slow or discrete, reset all values of RxnCounter to zero If the RxnCounter values for all fast or discrete reactions exceed a specified threshold, set PSS to TRUE However, if PSS is already TRUE and a slow or discrete reaction occurs, change PSS to FALSE and reset SaveCounter to its initial position.

8 If PSS is TRUE., then sample the underlying distribution by doing the following:

Save — e Save the current state of the system, Xcs„ey = & e Increment SaveCounter

9 If the probablistic steady-state has been sufficiently sampled, then leap forward by turning off the fast/discrete reactions If SaveCounter is greater than œ, then do the following: e Set FastRxnsOff to TRUE. e For all fast/discrete reactions, set +; to infinity and resort the heap e For all slow/discrete reactions, compute T; using Eq (2.168) and resort the heap

10 The current time and state of the system may be saved to disk as often as desired If Fas- tRxnsOff is TRUE., the state of the system at any time before the occurrence of the next slow/discrete reaction may be sampled from the distribution in Eq (2.163).

The proposed numerical method utilizes stochastic simulation techniques to model discrete reaction occurrences and their system effects, enhanced by a probabilistic steady state approximation for ease of implementation Key steps in the time iterative loop align with the original stochastic simulation algorithm, while additional steps manage the dynamic application of the PSSA Reactions are reclassified as fast or slow after each occurrence, with the RxnCounter variable facilitating quick convergence testing to a stable steady state Upon detecting convergence, samples are collected, and if sufficient samples are gathered before the next slow reaction, a significant adjustment occurs in reaction timings Fast reaction times are set to infinity, preventing their selection, while slow reaction times are calculated based on reaction propensities, ensuring the next occurrence is selected naturally from the heap This method allows for a substantial time leap, using the PSSA to predict the system's state, with opportunities to sample the system multiple times during long leaps.

AccuracyandSpeed 2 c c Q Q Q Q vu gu v ga và ga 99

This article presents four examples that highlight the benefits and limitations of the probabilistic steady state approximation, focusing on how the separation of timescales in a simple system impacts the method's accuracy and efficiency By varying the approximation parameters (A, ©, @), we analyze their influence on solution accuracy and method speedup, while maintaining a constant parameter € to ensure the examples involve only discrete reactions All computational times reported are derived from single realizations processed on an Itanium2 1.5 GHz processor.

Definitions of Error, Speed, and Network Characteristics

To evaluate the accuracy and efficiency of the equation-free probabilistic steady state approximation (PSSA), it is essential to establish quantitative measures that differentiate exact solutions from approximations In stochastic processes, we utilize weak mean and variance errors alongside the L² distance between probability distributions to quantify errors The weak mean and variance errors highlight inaccuracies in the solution's first two moments, while the L² distance assesses the overall distribution error Additionally, the performance of the method is influenced by the characteristics of the reaction network, specifically the stiffness in event execution and the timescale gap By illustrating the correlation between the proposed method's performance and these network characteristics, we aim to validate its effectiveness and predict its performance across various reaction networks.

An accurate solution for each reaction network is obtained by executing multiple independent realizations through the original stochastic simulation algorithm, allowing for the calculation of the mean, variance, and probability distribution of all chemical species over time Subsequently, we simulate the reaction networks using the PSSA under identical initial conditions and assess the error metrics The weak mean error for the ith chemical species is calculated accordingly.

Mean and, for simplicity, the maximum weak mean error is defined as

The weak variance error is similarly computed using i an(t) = |var(X% (t)) — var(XƑSS (0) (2.175) and the maximum weak variance error is similarly defined as max j

The Lạ distance between probability distributions is defined as

IAP,l;(r) =3 (PSS4(X; = xst) — PPSSA(X; = xt)? (2.177) xEQ; where the sum is taken over all sampled values of the number of molecules of the ịh species, or Q;.

Both probability distributions are conditioned on the same initial conditions and time Of the three measurable quantities of error, the Lz distance is the most stringent.

Stiffness in event execution refers to the challenge of accurately simulating reactions in a jump Markov process It is quantified by the ratio of the maximum to minimum reaction propensities at a given time, represented as s(t) = max(propensities) / min(propensities) This normalized stiffness indicates the fastest reaction's propensity relative to the slowest, measured in reactions per second Additionally, we define the size of the gap in timescales as the largest separation between fast and slow reactions at a specific moment.

_ min(a/()) G(t) = yan (2.179) where a and a” represent the fast/discrete and slow/discrete reaction propensities, respectively.

Our research demonstrates a significant correlation between the effectiveness of the probabilistic steady state approximation and the average timescale gap, denoted as E[G(t)] or < G(r) > Additionally, we introduce a definition for the speedup achieved by our proposed method.

The formula Speed Up = (2.180) T PSSA indicates that T represents the average computational time of a single trial, derived from multiple trials By utilizing these variables, we systematically evaluate the accuracy and efficiency of the proposed method through reaction networks that possess distinct and quantitatively assessed characteristics.

In Section 2.6.2, the example demonstrates an average stiffness of < S(t) > at 9.54 x 10^4 and an average timescale separation of < G(7) > measuring 6.52 x 10% The analysis employs a stochastic simulation algorithm alongside a probabilistic steady state approximation, utilizing a parameter set of (A, ©, @) valued at (10, 10).

10), we run 10 000 independent trials of the reaction network At this A, reactions RI and R2 are always classified as fast/discrete and reaction R3 is always slow/discrete The probabilistic steady

The time evolution of species A, E, and C in the illustrative reaction network is depicted in Figure 2.20, showcasing both the mean (top) and variance (bottom) using two methods: the stochastic simulation algorithm (solid/blue) and the probabilistic steady state approximation (circles/red) with the parameter set (A, ©, @).

In a study involving state approximation, over 250,000 trials were conducted, yielding only 897 executed reaction events and a computational time of 4.78 x 10^73 seconds per trial The time evolution of the mean and variance for each chemical species, illustrated in Figure 2.20, reveals that the mean of the approximate solution is highly accurate, although there is a minor systematic error in the variance Additionally, Figure 2.21 presents the probability distributions of species A and E at various time intervals, highlighting the L² distance to quantify the differences between the distributions This L² distance effectively measures the accuracy of the distributions, capturing subtle differences that may not be visually apparent.

By increasing the parameters @ and œ, we aimed to reduce the weak mean and variance errors, as well as the L¿ distance We conducted 10,000 independent trials for each parameter set, utilizing the probabilistic steady state approximation to compute the error metrics Figure 2.22 illustrates the time evolution of the maximum weak mean and variance errors in response to the increase of either parameter @, œ, or both Our findings indicate that enhancing the minimum convergence criteria significantly impacts the error reduction in this simple reaction network.

Increasing the minimum number of samples, œ, by 250 times resulted in a remarkable 700% reduction in both weak mean and variance errors Conversely, a 250-fold increase in @ only reduced the weak mean error by approximately 25% and had minimal impact on weak variance error This suggests that the convergence of marginal distributions for species A, B, and C may occur swiftly enough that the tested minimum value of @ is adequate Additionally, we explored variations in ỉ and @ logarithmically from 10 to 2500, resulting in a total of 625 parameter sets while maintaining A at 10 The average 12 distance between the probability distributions, depicted in Figure 2.23, reflects the averaging of distances across both species.

—0.35 and time points With increasing 0, the Lz distance initially decreases like œ and then remains

— 1g (ỉ, œ) = (10, 2500) c= TR es TTT rere reer rrr Tes (2500, 2500) eS 1< Af

Bc lo, ta en Mel Ab bape Dg %2ằ wà

Figure 2.21 illustrates the probability distributions of species A and E within a reaction network at various time points, comparing the solid blue line representing the stochastic simulation algorithm and the dashed red line for the probabilistic steady state approximation with parameters set to (A, ©, @) = (10, 10, 10) The inset highlights the time evolution of the Lz distance among the probability distributions for species A, C, and E, which remains relatively constant at an @ value of 200 Notably, as the parameter 0 increases, the Lạ distance shows minimal decrease, paralleling the weak mean and variance trends.

As the parameters @ and @ increase, the application of the PSSA becomes less frequent, leading to fewer advancements in simulation time and an increase in computational duration Table 2.10 highlights the impact of @ and œ on speed up, demonstrating a linear decrease in speed up with either parameter This linear decrease in speed up, coupled with an exponential reduction in solution error, indicates that there exists an optimal parameter set that balances accuracy and maximizes speed up Nevertheless, for this reaction network, even the fastest parameter set still achieves a reasonably accurate solution.

Consider the following reaction network used to study the implicit and explicit tau-leap methods [56].

= 0.357 increasing Time ọ IAP |, § 0.3} 0 500 1000 § Time [seconds] oO 0.25

The time evolution of the maximum weak mean error and maximum weak variance error in the illustrative example reaction network is depicted in Figure 2.22 This analysis utilizes the probabilistic steady state approximation with a parameter A set to 10, while varying the parameters @ and

Figure 2.23: The average L¿ distance between exact and PSSA-enabled probability distributions of the illustrative example reaction network as a function of the parameters © and œ with A = 10.

Table 2.10: The effect of © and w on the probabilistic steady state approximation’s speed up when simulating the illustrative example reaction network Parameter A is constant at 10.

Table 2.11: Accuracy and speed up of the probabilistic steady state approximation for the second example reaction network Parameter A is constant at 30000.

Method (@,@) | Speed Up | Executed Reactions | Lạ Distance (S1, $2, S3)

PSSA (100, 100) 9.14 22575 0.0135 | 0.0136 | 0.0142 with initial conditions S|, = 400, Ss, = 798, and 53, = 0 molecules” This reaction network has an of 2.89 x 10! and a < G(t)> of 2.24 x 10° Rathinam et al report a speed up of

7.8 and 380 for the explicit and implicit tau leap methods, respectively [56] Here, we perform a similar analysis as in the first example, using both the original stochastic simulation algorithm and the PSSA methods with specific parameter sets to simulate the system on the time interval [0, 0.2] seconds Varying ỉ and @, we compute the probability distribution of all species at t = 0.2 seconds, the Lz distance between the exact and approximated distributions at this time, and the maximum weak mean and variance errors over time For this example, we set A to 30000, keeping it within the wipe gap in the timescales Using the first parameter set of (10, 10), the probability distributions of all three species are very accurately captured (see Figures 2.24 - 2.26) with a speed up of 76.37 By increasing @ or @ or both by a factor of 10, we obtain reductions in the L¿ distance, but also in the speed up (Table 2.11).

Conclusion ee 115

The equation-free probabilistic steady state approximation is an efficient method for simulating complex reaction networks that involve both fast and slow reactions This technique can be integrated with existing kinetic Monte Carlo simulators and is capable of handling highly non-linear dynamics without making assumptions about the steady state distribution It accurately reproduces not only the mean and variance but also the complete probability distribution of the system Additionally, this method can be combined with hybrid stochastic approaches to effectively manage reaction networks that include various types of reactions, regardless of stiffness.

A Fortran95 numerical implementation of the equation-free probabilistic steady state approximation is available on request.

Hy3S: Hybrid Stochastic Simulation for Supercomputes

Stochastic simulation, particularly kinetic Monte Carlo, effectively predicts the intracellular dynamics of biological organisms, capturing phenomena overlooked by deterministic methods, such as noise-induced oscillations and population heterogeneity These probabilistic effects significantly influence both the quantitative and qualitative aspects of biological systems, underscoring the importance of stochastic approaches in understanding natural and synthetic systems The original stochastic simulation algorithm and its improved variants provide precise realizations of jump Markov processes, accurately modeling the fluctuations in molecule numbers within a single cell However, as these methods execute reaction or diffusion events individually, their computational cost escalates with the number of events, making it challenging to simulate systems with both fast and slow reactions, such as signal transduction networks coupled with gene expression Consequently, the need for approximate and hybrid stochastic methods has arisen to address the impracticalities of simulating large, realistic biological systems characterized by diverse timescales.

Hy3S (pronounced hi-three-ess) is an open-source software package designed for hybrid stochastic simulation methods, specifically tailored for supercomputers Its primary aim is to enable users to leverage cutting-edge stochastic techniques to simulate vast and realistic biological systems The package features a variety of hybrid stochastic simulation methods, providing an accessible platform for researchers and developers in the field.

The MATLAB-driven GUI utilizes the NetCDF interface to efficiently store model and solution data in a platform-independent binary format By integrating MATLAB's scripting capabilities and data analysis functions with NetCDF's rapid data handling and our advanced MPI-parallelized hybrid stochastic simulation method, simulating large biochemical reaction networks with varying time scales becomes straightforward While designed for an MPI-enabled Intel Itanium2 cluster running Linux, the simulation programs are also compatible with x86, IBM, Cray, and SGI Altix platforms This tool is primarily aimed at scientists, engineers, and mathematicians experienced in computational modeling.

Recent advancements in approximate stochastic methods aim to reduce the computational costs of stochastic simulations, with a focus on hybrid stochastic approaches These methods effectively partition chemical reaction systems into subsets, utilizing distinct mathematical representations for each subset's time evolution and merging them to achieve accurate solutions while minimizing costs However, the challenge lies in addressing the coupling effects among these subsets, necessitating the simultaneous resolution of various mathematical processes Although several hybrid stochastic methods have emerged, the goal of developing a fast and accurate alternative to the original stochastic simulation algorithm for large, complex reaction networks—especially those with dynamic stiffness or varying timescales—remains unfulfilled Additionally, numerous software packages have been created to simulate biochemical network dynamics, employing both deterministic and stochastic techniques.

In our recent work on developing hybrid stochastic methods, we have progressed much closer to this goal by efficiently and accurately simulating a coupled jump/continuous Markov process

The method utilizes differential Jump equations, a form of stochastic differential equation (SDE), to determine the timing of slow reactions among numerous unique chemical species By linking jump and continuous Markov processes, this approach simultaneously solves a system of SDEs to analyze both fast and slow dynamics The established relationship between hybrid jump/continuous Markov processes and the robust theory of SDEs provides a solid foundation for the numerical method, facilitating the use of advanced numerical integration techniques that are implicit, higher order, and adaptive This connection also enables a precise characterization of both local and global errors in the solution, moving beyond vague approximations.

Software Implementation 2.2 0.0.00 000 eee eee 117

An Overview of the Software Design

The software package primarily consists of simulation programs developed in Fortran95/2k and parallelized with MPI, which utilize NetCDF input files to simulate the stochastic dynamics of biochemical models The NetCDF format is open and self-describing, facilitating easy input creation and output analysis across various programming languages To streamline the development of biochemical networks, a user-friendly Matlab-driven GUI is provided for generating NetCDF files, alongside the option to use Matlab’s scripting capabilities for more complex network construction Users can also leverage Matlab's robust functions for data analysis and visualization of the simulation results While Matlab is used for network creation and data analysis, both the simulation programs and the NetCDF format remain fully open, allowing for ongoing research and development of efficient hybrid stochastic methods and the integration of diverse tools for biochemical modeling.

The simulation program collection features four numerical implementations of the hybrid jump/continuous Markov stochastic simulator (HyJCMSS) and the Next Reaction variant of the original stochastic simulation algorithm Each program is optimized for parallel processing with MPI, resulting in a total of ten distinct simulation options Comprehensive testing has shown that HyJCMSS stands out as the most efficient and accurate hybrid stochastic numerical method, particularly for simulating large reaction networks.

This article will provide an overview of various numerical implementations of the algorithm, highlighting that each program can be utilized as a 'black box.' This allows researchers who may not be focused on the intricacies of each numerical method to effectively engage in productive research.

An Overview of the Numerical Methods

The hybrid jump/continuous Markov stochastic simulator efficiently partitions reaction systems into 'fast/continuous' and 'slow/discrete' subsets It employs the chemical Langevin equation to describe fast reactions and uses zero crossings of Jump equations to compute slow reaction times Both the chemical Langevin equation and the differential Jump equations are classified as lô type stochastic differential equations (SDEs) and are integrated using stochastic numerical methods This approach significantly enhances simulation efficiency compared to traditional jump Markov simulations, with the accuracy of the continuity approximation governed by two parameters, £ and A, which fully parameterize the approximation for any reaction network.

The Multiple Slow Reaction (MSR) approximation significantly enhances the efficiency of simulating large biochemical networks by accommodating multiple occurrences of slow reactions between numerical integrations of the chemical Langevin equation However, this method may compromise accuracy in systems with highly mixed timescales To address this, we introduce an MSR tolerance that ranges from zero, which disables the approximation, to one, which employs it without restraint For optimal efficiency and accuracy in simulating systems with mixed timescales, a default tolerance value of 1/e strikes the best balance.

The numerical theory for integrating stochastic differential equations (SDEs) is markedly different from that of deterministic differential equations, particularly when dealing with higher-order, implicit, and adaptive integration methods In this context, SDEs are often non-linear, multiplicative, non-commutative, and involve multiple Wiener processes, also known as Brownian paths We explore the solutions to these SDEs using four strong numerical methods: the fixed time step Euler-Maruyama method, the fixed time step Milstein method, the adaptive time step Euler-Maruyama method, and the adaptive time step Milstein method Notably, we include the adaptive Euler-Maruyama method for educational purposes, despite its potential to converge to incorrect solutions, highlighting the critical distinctions between stochastic and deterministic numerical methods.

The Euler-Maruyama method incorporates only the drift and diffusion terms from an Itô-Taylor expansion, resulting in a local strong error of O(Δt) In contrast, the Milstein method adds a third term that accounts for multiple two-dimensional Wiener integrals, reducing the local strong error to O(Δt) However, the computational intensity of evaluating these integrals may lead to decreased simulation efficiency.

Adaptive time step methods enhance the efficiency of simulations for stochastic differential equations (SDEs), particularly in dynamically stiff systems These methods adjust the time step based on the system's stiffness, reducing it during periods of dynamical stiffness and increasing it when the system stabilizes Unlike deterministic methods, adaptive approaches require conditioning the paths of the Brownian process on previously realized points to maintain accuracy, avoiding bias in the solution Optimal time step selection relies on assessing the local error in both drift and diffusion terms, alongside ensuring that fast reactions are accurately approximated as a continuous Markov process This evaluation guides the necessary adjustments to the time step For a comprehensive understanding of adaptive time step methods for SDEs, additional resources are available, and a comparative overview of numerical methods is provided in Table 2.15, with a brief discussion on the Euler-Maruyama and Milstein methods.

Simulation programs offer various command-line parameters that allow users to customize the accuracy and efficiency of simulations for specific systems Users can also opt for default parameter values to achieve a reasonably accurate and efficient solution For a detailed overview of these parameters, please refer to Table 2.16.

In simulation programs, optimized data structures enhance computational performance through three types of numerical operations When addressing systems of stochastic differential equations (SDEs), we leverage the sparseness of the stoichiometric matrix and the two-dimensional stochastic integrals By developing indexes and inverse indexes, we effectively map the complete system to a more efficient reduced format.

Table 2.15: An overview of the Hy3S numerical methods

Next Reaction | Essentially exact Extremely slow for ‘large’ sys- variant of SSA tems

The HyJCMSS methods are significantly faster for large, stiff systems, while the Fixed Euler-Maruyama SDE numerical integrator excels with non-stiff systems However, when dealing with stiff systems, species populations may become negative, making it crucial to find an accurate time step, which can be a challenging task.

HyJCMSS Increased accuracy May use a | Evaluation of 2D Ité integrals de- Fixed Milstein larger time step creases speed of simulation.

HyJCMSS automatically selects an accurate time step based on the Stochastic Differential Equation (SDE) Maruyama tolerance However, it does not consistently converge to the correct solution, making its usage impractical This content is provided solely for educational purposes.

HyJCMSS Dynamically chooses accurate | Slower than fixed methods Adaptive Mil- | time step Increased efficiency | for systems with constant stein when transient stiffness exists | timescales, due to the computa-

With a reasonable tolerance, convergence to correct solution is guaranteed. tional overhead in the adaptive code.

Table 2.16: A description of each command line argument and their default values

Command line pa- | Description Which methods | Default rameter use it value

Filename NetCDF file name All None

The random number generator utilizes a system-dependent random seed value, while Epsilon (€) represents the minimum number of molecules, set at 100 for both reactants and products, to approximate a continuous Markov process in HyJCMSS simulations Additionally, Lambda (A) indicates the minimum reaction rate, which is defined as 10 molecules per second for the same continuous Markov process methods in HyJCMSS.

MSR Tolerance Maximum relative effect of slow | All HyJCMSS | 1/e reactions per numerical integra- | simulation tion of SDEs methods

The SDE tolerance defines the maximum values for drift and diffusion error criteria, while the maximum time step for numerical integration is set to 0.10 seconds To enhance computational efficiency in solving stochastic differential equations (SDEs), we employ a dependency graph that computes reaction propensities and their derivatives only when changes occur, akin to the Next Reaction variant of the stochastic simulation algorithm, but adapted for fast/continuous reactions and special events Additionally, an indexed priority queue is utilized to identify the reaction time for the next zero crossing of the differential jump equations, along with a sorted queue for determining the timing of upcoming special events These optimized data structures significantly boost simulation efficiency, particularly in systems with a high number of reactions and chemical species.

Each simulation program is optimized for parallel processing using MPI, allowing for the execution of numerous independent stochastic dynamics simulations of a biochemical network By allocating independent trials to each processor, efficiency can approach 100% when the number of trials is divisible by the number of processors The implementation ensures that no processor writes to the same section of the NetCDF simultaneously, while all processors can read from it concurrently For instance, running 10,000 independent simulations across 500 processors significantly reduces computational time, enhancing overall performance.

500 times Computing clusters with thousands of processors, such as the NSF supported TeraGrid

[112], enable this high level of research productivity.

Solution of a Hybrid Jump/Continuous Markov Process

The Fixed Euler-MaruyamaMethod .Ặ 123

The Euler-Maruyama method is a robust explicit stochastic numerical integration technique known for its strong accuracy, achieving an order of O(x⁄A?) This method is derived from the Itô-Taylor expansion, focusing on the solution and truncating after the diffusion term When applied to the chemical Langevin equation, the (k+1)th iteration of the scheme for the i-th chemical species is defined accordingly.

XếU! =XF+ Ð vial (Sart VY vạVaƒ(X9AW/" (2.185) j=l j=l where AW,” is a normal Gaussian random number with a mean of zero and a variance of Af.

Applied to the differential Jump equations, the scheme is simply

The Fixed Milstein Method 0.0 0 0.2 0000 123

The Milstein method is a highly accurate explicit stochastic numerical integrator, achieving a strong accuracy of O(A?) This enhanced precision is due to the retention of O(A?) terms in the Itô-Taylor expansion around the solution When applied to chemical Langevin equations, the Milstein scheme demonstrates its effectiveness in stochastic simulations.

X£!! = XR TT vạaj(X9Ar+VIET vial (KAW)!

+30 a1 Drei VianV jai ơ xX} ax i je)

In the analysis of the third term, we consider all combinations of fast and continuous reactions, which involve multiple double stochastic integrals that require numerical approximation for distinct indices The evaluation of these stochastic integrals is detailed in section 2.4.4 Due to the typically sparse nature of the stoichiometric matrix, many summations in the third term contain zeros By classifying reactions as fast/continuous or slow, we can create an index of non-zero values, allowing us to compute only the necessary two-dimensional stochastic integrals and coefficients Notably, the Milstein scheme applied to the differential Jump equations remains consistent with the methodology outlined in Eq (2.186), as it does not incorporate a Wiener process.

Adaptive Methods 0 0.0 ce nu 1g và và kia 123 2.7.7 The Graphical User Interface 2 ee ee 125

Our adaptive time step scheme is implemented through a three-step process: first, we evaluate criteria to measure the local error of the solution; second, we adjust the time step by either halving or doubling it based on these criteria; and third, we determine the Wiener increments for the adjusted time step We limit our adjustments to halving or doubling the time step for two main reasons: it transforms the Brownian bridge into a Brownian binary tree, facilitating efficient storage and retrieval, and it simplifies the calculation of the necessary 2D stochastic integrals in the chemical Langevin equation, which involves multiple, non-commutative, multiplicative noise sources.

We utilize an established set of criteria to assess the local error in both the drift and diffusion components of the solution The local error for the drift is evaluated by calculating the difference between the Euler and Heun methods.

The equation E4(X.At) = _ (z(xf+arz (X9))-ứ (X9) | (2.188) indicates that the drift local error is of order O(A72) To assess the diffusion local error, an Ité-Taylor expansion of the Milstein scheme is performed, focusing on the most computationally efficient O(A?3⁄2) term The criteria for diffusion are also established within this context.

The max norm is defined as the maximum absolute sum along the j dimension A critical requirement is that the number of molecules for all species involved in fast or continuous reactions must exceed a threshold of 20 molecules This ensures that these reactions can be effectively modeled as a continuous Markov process during numerical integration We assess Eqs (2.188) and (2.189) alongside the third criterion for each species in the system, considering the local error to be acceptable only when these equations fall below a specified tolerance value and the third criterion is satisfied.

The time increments in the simulation are structured using a binary tree, where the initial time step is represented at the top row Each subsequent row is formed by halving the time step of its parent nodes, with the number of nodes in each row consistently being 2 raised to the power of R (where R = 0, 1, ) The current time step is always At, / 2, with At, being the initial time step If the local error is significant, the time step is halved, and we move down a row; conversely, if the local error is minimal, we may double the time step and ascend a row, provided the current node number is divisible by two If it is not divisible, we retain the current time step As the numerical integration progresses, the node number is incremented accordingly.

The primary purpose of utilizing the described binary tree is to efficiently compute Brownian bridges Initially, the Wiener increments for the first time step are calculated at the top row of the tree As the time step decreases, we avoid re-evaluating a Wiener increment for the halved time increment since the Wiener process value at the final time has already been determined Instead, we compute the Wiener increment for the halved time step, conditioned on the starting and ending values of the Wiener process Intermediate Wiener increments are generated using specific equations from section 2.4.9, ensuring that previously generated increments are reused and new ones are conditioned on them Furthermore, realizations of the two-dimensional Itô integrals for any given time increment are never duplicated The Wiener increments, current time step, and two-dimensional Itô integrals are then fed into either the Euler-Maruyama or Milstein numerical integrators to obtain a trajectory of the system’s dynamics for the subsequent time step.

Fle Desig Took Hoda infomation Hep *

Hy3S -~ Hybrid Stochastic Simulations for Supercomputers

Graphical User Interface v1.1 Simulation Parameters & Options

Start Tare End Tune Storage Tere lenhet Vigteres tagns Mi Cell Deasion ne of Cell a trượt Gyeten: Paituabatians.

Figure 2.32: The main window of the graphical user interface

The graphical user interface facilitates the rapid creation of biochemical networks, enabling users to set parameters, model data, and generate input NetCDF files using the open-source MexCDF project within Matlab The interface features a main window and several auxiliary windows, which display essential model data, including chemical reaction systems, initial species conditions, volume, time settings, and trial counts Users can incorporate discrete events, such as cell replication, gamma-distributed reactions, and timed perturbations affecting molecule counts or reaction kinetics Additionally, the interface allows for simple sensitivity analyses by enabling the creation of multiple models within a single NetCDF file, each with unique initial conditions or kinetic parameters.

Adding Reactions and Setting Initial Conditions

Users can input the stoichiometry of reactants and products, along with the rate law and kinetic parameters for each reaction, as illustrated in Fig 2.33 Currently, there are eleven distinct rate laws available for use.

Hy3S Hybrid Stochastic Simulations for Supercomputers

Graphical User interface v1.1 Simulation Parameters & Options

Chemical Species List ôFe bar? Start Pree fed Từng Storage Tin

Add Reactions eres reuylion #33628 Esio nei the taitered form.

Ee BUR 6802-98 CECH ATTY Bend SBIR TORIES Rate Laws nd Bt Ort nai De San by G®egbog Phá Ree

HA) cenerete the ro ID and pr2nli kh tháp ù : TỔ lỏt map

CÔ Dal GẦN [nhgiô-boderoEk | A11 TT

: ‘Se Xi Canfang toe Metecrawe Species Check S00" Công Cin Add ngj LỆ Bae Cheese Save

= teas we Ìf1.CÂorLÊtc of HGÌĐ HAY )0ĐDIAt Trokdo-(SV hpecsss cdvkled mgahy re uoheeeed

CỔ Da che lỏi tư R@t con A3001 pets atten ciểnửnG cotues ia

—— _ (Bs ee oases Hư ảnh đargted

388 7 an 313L Zoeoblir+ thế đún, đảpze Bẫ TU To 'ỉ ow

0 Cán sersitadl Eeivdg4 Laer wR BR

Figure 2.33 displays two auxiliary windows that facilitate the addition of reactions and the configuration of initial conditions, with instructions for further additions provided In addition to standard mass action rate laws, we have incorporated other effective rate laws, including generalized power law kinetics, to enhance functionality.

The software enhances the efficiency of adding reactions by assuming mass action kinetics and automatically populating species information in the GUI Users can input kinetic parameters or choose from various rate laws, offering flexibility in reaction addition Initial conditions for chemical species are specified in molecules, with an option to 'Split On Division' (SOD) to distribute selected species among daughter cells during replication Additionally, the software enables users to manage solution data by discarding or saving information for specific species, streamlining the modeling process.

Biological systems often display complex behaviors that cannot be easily modeled by simple biochemical reactions To enhance the simulation of these behaviors, we incorporate various special events that represent specific biological processes Notably, cell replication is more accurately modeled as a discrete event that distributes soluble molecules to daughter cells, rather than being approximated as a continuous dilution rate The timing of cell division is variable, typically following a Gaussian distribution around a mean Users can activate cell division special events in the main GUI window and input the mean and standard deviation of replication times Additionally, the cell volume increases exponentially from its initial value at a rate inversely proportional to the mean replication time, resetting to the initial volume upon cell division.

Transcriptional and translational elongation, often overlooked, plays a crucial role in influencing the qualitative dynamics of biological systems by introducing delays in mRNA and protein production, as well as increasing stochasticity The movement of RNA polymerase or ribosomes can be modeled as a series of N first-order reactions, where N represents the number of base pairs or codons However, due to the typically large value of N, which can hinder simulation efficiency, it is practical to assume a constant elongation rate, k, and represent the entire elongation process with a single gamma-distributed event By modeling transcriptional and translational elongation as gamma-distributed events with rate k and N steps, researchers can achieve a balance between accuracy and computational efficiency.

Modifying the kinetic parameters or the number of molecules of a chemical species during a simulation allows researchers to evaluate the system's response to external perturbations or to simulate complex phenomena For instance, one can introduce an inducer by either increasing its molecular count or its influx rate, or enhance the influx of a receptor-binding ligand to analyze the signal transduction network's response characteristics By implementing multiple perturbations, the system can effectively model intricate external behaviors, with these adjustments accessible through an auxiliary window linked to the main interface.

Specifying Multi-Model NetCDF Files and Simulations

Scientists and engineers studying natural biological systems or designing synthetic ones often need to vary kinetic parameters or initial conditions to observe their effects on system dynamics Instead of creating multiple separate NetCDF files for each model, Hy3S enables the creation of a multi-model NetCDF file that can include various biological models with different kinetic parameters and initial conditions When a simulation program processes this multi-model NetCDF file, it simulates the stochastic dynamics of each model across specified independent trials, storing the resulting four-dimensional solution data (Number of Models x Number of Trials x Number of Time-points x Number of Saved Species) back into the file This data can be easily analyzed using programs like Matlab, allowing for straightforward sensitivity analysis by varying parameters and examining the resulting dynamics.

In order to specify a multi-model NetCDF file, the user may select an Experiment Type of 2 or

‘Combinatorial variation of kinetic parameters and initial conditions’ and add variations by pressing

Hy3$ ~ initial Condition & Kinetic Parameter Variation Current Variations Reaction List Species List

The auxiliary window interface allows users to add systematic variations of kinetic parameters or initial conditions, with the ability to specify a range of values and choose between linear or logarithmic steps By incorporating multiple variations, the GUI computes all combinations of the selected parameters and stores the relevant data in a NetCDF file Users can also create a customized list of kinetic parameters and initial conditions, facilitating the variation of specific parameters while maintaining constraints on others, such as adjusting a backward kinetic constant while keeping the equilibrium constant constant This capability to generate multi-model NetCDF files enables the integration of various models into a single, compact format, streamlining data analysis.

Creating Complex Biochemical Networks and Analyzing Data with Scripts

Resultsand Examples 2 0 ee 129

This article presents three key examples highlighting the unique features of Hy3S The first example evaluates the accuracy of the Hy]CMSS numerical method, while the second serves as a benchmark for large-scale systems, testing Hy3S's capability to simulate biochemical networks with up to twenty thousand reactions The third example showcases Hy3S's proficiency in simulating complex, realistic bistable multiscale biochemical networks that exhibit spontaneous escape All computational times reported are derived from simulations conducted on an Itanium2 1.5 GHz processor.

An Extensive Test of Accuracy

When we first proposed the numerical method simulating a hybrid jump/continuous Markov process [2] we analyzed the method’s accuracy with numerous examples, including a simple reaction

Table 2.17: A Non-Linear Cycle Test

Fast/Continuous Reactions Slow Reactions

E+EF—ED+A k= 3k kp = 1505500 [M sec]! ks = 1.0 Na V/Q? [M sec] 1

The study explores a network of molecules characterized by a linear three-cycle of fast/continuous reactions and two non-linear slow reactions, referred to as the "Cycle Test." To assess the method's accuracy beyond linear fast/continuous reactions, a 'Non-linear Cycle Test' is implemented, which includes three fast/continuous and two slow reactions This research involves a series of accuracy measurements to analyze how the error in the probability distribution, mean, and variance of the solution varies with system size System size, defined by the number of initial reactant and product molecules in the fast/continuous reactions, is crucial for determining the validity of approximating these reactions as a continuous Markov process.

The hybrid jump/continuous Markov stochastic simulator experiences two primary sources of error: the approximation of fast/continuous reactions as a continuous Markov process and the numerical integration of the chemical Langevin and differential Jump equations As the system size increases, we anticipate a reduction in the first error component Additionally, decreasing the time step in numerical integration is expected to minimize the second error component The magnitude of this second error is directly proportional to Ar’, where y represents the strong or weak order of accuracy of the stochastic numerical integration method, depending on the error definition.

By varying the time step of numerical integration, we can measure the contribution of the second error source.

To effectively compare the HyJCMSS method with the Euler-Maruyama and Milstein schemes, it's essential to assess the strong error by analyzing the trajectory differences between the hybrid approximate and exact solutions This involves fixing the Brownian paths and evaluating the solution through various numerical schemes or, if possible, an exact analytical solution However, due to the complexities of simulating a coupled jump/continuous Markov process, fixing the random process to accurately evaluate the strong error poses challenges Therefore, we will focus on calculating the weak mean and variance errors, normalizing these results against the exact mean and variance for more straightforward comparisons.

Both the hybrid approximate and the exact mean and variances are computed by running at least 10

The analysis of 000 independent trajectories reveals that both the Euler-Maruyama and Milstein numerical schemes exhibit a first-order accuracy when evaluated using the weak definition of error.

We calculate the probability distributions for both exact and hybrid approximate solutions, along with the average normalized weak mean and variance errors, averaging these metrics across all time points and species.

In our simulations, we utilize the default HyJCMSS parameters set at (€ = 100, A = 10, MSRro; = 0.01), where MSRz; represents the maximum tolerance for the Multiple Slow Reaction approximation We initiate the process using the Euler-Maruyama scheme with a consistent time step of 0.01 seconds, progressively increasing the system size of the Non-Linear Cycle Test from 100.

As the system size reaches 10,000, the ratio of the standard deviation to the mean of the solution approaches zero, indicating convergence towards the thermodynamic limit The HyJCMSS method effectively captures the probability distribution of solutions across all system sizes and species types, regardless of their response to fast or continuous reactions At a system size of 100, the method treats some fast/continuous reactions as slow due to the limited number of reactant or product molecules However, as the system size increases to 200 and beyond, all fast/continuous reactions are accurately classified Notably, there is no significant difference in the solutions between system sizes of 100 and 200 The HyJCMSS method dynamically classifies reactions as fast/continuous, approximating them as a continuous Markov process only when it ensures accuracy In the absence of fast/continuous reactions, it seamlessly transitions to the Next Reaction variant of the stochastic simulation algorithm, making it a versatile and effective alternative to traditional stochastic simulation methods.

As the system size increases, the stiffness of the chemical Langevin equation also rises, indicating a disparity in timescales within the differential equations Figure 2.36 illustrates the normalized weak mean and variance errors as system sizes grow Maintaining a constant time step leads to an increase in numerical integration errors of the SDEs, particularly in weak variance, while the mean error slightly decreases This suggests that stiffness is primarily influenced by the Wiener process terms, like the diffusion term, rather than the drift terms To achieve accurate solutions for both variance and mean, it is essential to adjust the time step to accommodate stiffness in either drift-dominated or diffusion-dominated scenarios.

By decreasing the time step of numerical integration, we reduce the weak variance error with little change in the weak mean error (Fig 2.37).

Automatically and dynamically determining a time step that minimizes numerical error is essential for achieving accurate weak mean and variance results.

E [Molecules] E [Molecules] E [Molecules] E [Molecules] E [Molecules] E [Molecules]

H [Molecules] H [Molecules] H [Molecules] H [Molecules] H [Molecules] H [Molecules]

The probability distributions for species E (top) and H (bottom) in the Non-Linear Cycle Test are illustrated at a time of 10 seconds across system sizes from 100 to 10,000 The integration of the chemical Langevin and differential Jump equations is performed using a fixed Euler-Maruyama scheme with a time step of 0.01 seconds, while all other parameters remain at their default settings.

The Non-Linear Cycle Test demonstrates weak mean and variance errors, as illustrated in Figure 2.36 The average normalized weak mean error is presented at the top, while the variance errors are shown at the bottom, utilizing the Euler method for analysis.

Maruyama scheme with a fixed time step of 10-7 seconds and system sizes of (red) 100, (green)

200, (blue) 316, (magenta) 1000, (cyan) 3160, and (black) 10 000 All other parameters are set to default values.

The impact of the integrator time step, Atspg, on weak mean and variance errors is illustrated in Figure 2.37, showcasing the average normalized weak mean and variance errors for a Non-Linear Cycle Test with a system size of 10,000, utilizing the fixed time step Euler-Maruyama scheme across a range of A7spz from 10” to 107? seconds To evaluate the effectiveness of an adaptive time stepping method, we analyze how user-defined tolerance influences solution accuracy, varying the tolerance from 107* to 107” and presenting the weak mean and variance errors in Figure 2.38 While adaptive time stepping can relieve users from manually setting accurate time steps, it may incur higher overhead and computational time compared to fixed-step methods Nonetheless, for systems with transient or intermittent stiffness, adaptive schemes can enhance computational efficiency by adjusting time steps according to system stiffness The HyJCMSS method is notable for transforming solutions of hybrid systems governed by Master and Fokker-Planck equations into stochastic differential equations (SDEs) The numerical solution theory for SDEs has advanced significantly, allowing for the assessment of asymptotic convergence properties, accuracy, and numerical efficiency across various numerical schemes for Wiener process-driven SDEs, including the chemical Langevin and differential Jump equations By testing the method's accuracy on a fully non-linear example, we demonstrate that its accuracy is primarily determined by the numerical integration of SDEs, which can be fine-tuned using a few user-defined parameters, such as the numerical integration time step and the maximum local error tolerance.

The impact of the user-defined tolerance, SDE7,;, on weak mean and variance errors is illustrated in Figure 2.38 This analysis focuses on the average normalized weak mean (blue) and variance (red) errors during the Non-Linear Cycle Test, conducted with a system size of 10,000 The results were obtained using the adaptive time step Milstein scheme, with SDE7,; values varying from 10^7 to 10^75, while all other parameters remained at their default settings.

The Non-Linear Cycle Test serves as a toy system to validate the accuracy of the HyJCMSS method In contrast, real-world systems often involve thousands of reactions and chemical species To assess the computational efficiency of HyJCMSS in simulating extensive biochemical networks, we utilize a large-scale system benchmark This benchmark consists of Ry bi-molecular second-order fast/continuous reactions, coupled with Rs bi-molecular second-order slow reactions, resulting in a total of Ry + Rs reactions and 3Ry + chemical species.

2 R; The minimum degree of the dependency graph of the reaction propensities is always greater than R,/Ry + 1, making this network less sparse than most biochemical networks We increase

Introduction 2 Ha.ọnẶaa

An Overview of the Chapter 0 0 ee 143 3.2 An Overview of Regulated Bacterial Gene Expression 4 144

This article provides an overview of regulated gene expression, highlighting the key interactions between gene expression machinery and DNA/RNA binding sites during transcriptional and translational processes It examines how DNA and RNA sequences influence the basal rates of gene expression, focusing on sequence determinants that affect the kinetics of critical steps Additionally, the article explores the role of regulatory molecules, such as transcription factors and microRNAs, in modulating gene expression rates Finally, it addresses the functions of the RNA Degradosome and proteasomes in degrading RNA and proteins, emphasizing how bacteria can specifically target these molecules for expedited degradation.

Physical and chemical modeling offers distinct advantages and limitations Experiments reveal outcomes based on specific configurations in reality, while simulations explore an infinite range of potential configurations However, not all configurations manifest in reality, and the sheer number of real configurations makes exhaustive experimentation impractical Consequently, a model lacking experimental validation may lead to inaccuracies, while experiments conducted without a guiding model can lack significance Therefore, a synergistic approach that incorporates both modeling and experimentation is essential for accurate scientific understanding.

In this section, we explore the modeling of regulated gene expression in bacteria through a system of chemical and biochemical reactions Our systematic approach converts key molecular interactions in transcription and translation—encompassing protein-DNA, protein-RNA, protein-protein, and RNA-RNA interactions—into a comprehensive system of chemical reactions We empirically measure the kinetics and thermodynamics of these interactions in the laboratory, summarizing available data While a full kinetic mathematical description is typically employed, we often find it more practical to assume that protein-DNA interactions at the promoter reach chemical equilibrium Thus, we demonstrate how to utilize the chemical partition function as a simplified mathematical representation of the mechanistic interactions involved in Holoenzyme and transcription factor assembly.

In our research, we focus on designing synthetic gene networks to achieve two distinct behaviors The first example is a protein device that activates gene expression only when two specific transcription factors are present, effectively mimicking a Boolean "AND" function This system offers advantages such as modularity, scalability, high fidelity, and a rapid response to inputs By linking multiple protein devices, we can program bacteria to respond to a defined set of inputs with a predetermined genetic output The second example involves a three-gene system that generates long-lived oscillations through repressor regulatory connections, with the promoter region's structure influencing the oscillation's period and amplitude.

3.2 An Overview of Regulated Bacterial Gene Expression

Gene expression is the process by which cellular organisms transform the information stored in their genomes into functional RNA and protein molecules These molecules play a crucial role in regulating biochemical reactions, metabolism, cell growth, signal transduction, cell motility, and differentiation By regulating gene expression, cells can adapt to their environment, modify their internal processes, replicate, and influence their surroundings Ultimately, the regulated expression of genes distinguishes living organisms from non-living chemical entities.

This article reviews the series of biochemical reactions that transcribe DNA into messenger RNA (mRNA) and translate mRNA into proteins, a process known as "the Central Dogma," coined by Francis Crick Focusing on bacterial gene expression, particularly in Escherichia coli, we highlight the well-understood biochemical mechanisms that govern this process Additionally, we explore how organisms can modify gene expression rates in response to intracellular signals, revealing multiple strategies to regulate RNA and protein production Understanding these mechanisms is essential for designing synthetic gene networks tailored for specific applications.

Transcription ng và kg kg tk kia 145

Gene expression starts with transcription, an endothermic biochemical process where RNA polymerase, a protein composed of four subunits (α, β, β’, and ω), attaches to double-stranded DNA This enzyme converts the DNA base pair sequence into an RNA transcript made up of ribonucleotides that are complementary to the DNA's template strand Transcription occurs in three main stages: initiation, elongation, and termination.

To begin transcription, RNA polymerase must attach to the promoter recognition region of a gene However, RNA polymerase has a low affinity for DNA and can only slide along it in search of higher affinity regions To facilitate this process, a protein known as the sigma factor binds strongly to the promoter, enhancing the interaction with RNA polymerase and forming a larger complex called the Holoenzyme This initial binding of the Holoenzyme to the promoter DNA is often the rate-limiting step in transcription initiation The Holoenzyme's structure resembles a pair of pincers, featuring an entrance and exit for double-stranded DNA and an exit channel for single-stranded RNA.

After the Holoenzyme binds to promoter DNA, it undergoes a conformational change, allowing double-stranded DNA to enter the complex by unwinding into coding and template strands While the Watson-Crick hydrogen bonds between DNA base pairs are strong, RNA Polymerase exhibits even stronger protein-DNA interactions that facilitate this unwinding, often termed melting Although RNA also utilizes Watson-Crick base pairing, it is weaker due to the substitution of uridine for thymine, resulting in A:U pairing Once the template strand is inside, it pairs with free ribonucleic triphosphates (ATP, GTP, CTP, UTP) For RNA polymerase to move forward and initiate transcription, it must break existing protein-DNA interactions behind it and establish new ones ahead, a process known as promoter escape.

RNA polymerase catalyzes the polymerization of ribonucleotides in the 5' to 3' direction by utilizing complementary base pairing between DNA and RNA, along with the energy from triphosphate bonds Initially, the polymerization process is inefficient, resulting in the release of short RNA transcripts during a phase known as abortive initiation Following this, RNA polymerase backtracks to the promoter site to restart the initiation process Once the first ten nucleotides are synthesized, the o-factor may detach from the Holoenzyme complex, triggering conformational changes that enhance the rate and efficiency of polymerization by tightening the protein-DNA interactions This marks the transition to the transcriptional elongation phase.

During transcriptional elongation, RNA polymerase binds tightly to DNA, adding complementary ribonucleotides to the RNA transcript at a rate of 30 to 50 nucleotides per second, influenced by the concentration of ribonucleic acid triphosphates This RNA polymerase: DNA:RNA complex, known as the transcriptional elongation complex, occasionally introduces errors due to faulty base pairing, leading to transcription mistakes To rectify these errors, the complex employs an inefficient editing mechanism characterized by a slow excision rate of incorrect nucleotides, followed by backtracking The movement of the nascent mRNA transcript through the exit channel is hindered by incorrect RNA:DNA base pairings, resulting in the preferential removal of erroneous ribonucleotides, although this can also lead to the unintended excision of correct ones Ultimately, transcriptional elongation achieves an error rate of fewer than 1 in 10, making it more error-prone than DNA replication, which boasts an error rate of less than 1 in 10 million.

RNA polymerase will continue transcribing DNA into an RNA transcript until it disassociates from the DNA, releasing its strong protein-DNA interactions, and resulting in transcriptional termination

[158] Anything that prevents the RNA polymerase from continuing its polymerization reaction and translocation forward will result in its disassociation There are two general classes of terminators:

Transcription termination in RNA polymerase can occur through two primary mechanisms: the presence of proteins or RNA secondary structures that hinder RNA polymerase's forward movement, and AT-rich RNA sequences that diminish the stability of DNA:RNA base pairing within the enzyme's catalytic site The most prevalent terminator involves an RNA sequence that incorporates both mechanisms, specifically an AU-rich mRNA sequence followed by a GC-rich symmetric sequence This AU-rich sequence reduces the RNA:DNA base pairing affinity, as A:U pairs are held together by two hydrogen bonds, compared to the three hydrogen bonds found in G:C pairs.

A GC-rich two-fold symmetric sequence can impede RNA polymerase translocation by forming a stable hairpin structure When RNA polymerase dissociates from the DNA, it releases the RNA transcript that has been synthesized.

Translation is an endothermic biochemical process that synthesizes a protein's primary amino acid sequence from an RNA transcript, specifically messenger RNA (mRNA), which accounts for less than 10% of a cell's total RNA The translation process consists of three key steps: initiation, elongation, and termination mRNA features specific sequence determinants in its 5’ UnTranslated Region (UTR) that facilitate ribosome binding and the start of translation In bacteria, multiple co-regulated proteins are often translated from a single mRNA transcript organized into an operon, which contains several start sites, allowing for the sequential production of different protein products.

Various RNA transcripts, unlike those with specific sequence determinants, perform essential functions by interacting with proteins, other RNAs, or small molecules These interactions are crucial for numerous processes, particularly in the regulation of translation, as seen with small hairpin RNAs (shRNAs) and microRNAs.

This section examines the critical roles of ribosomes, messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA) in the translation process, highlighting their contributions to catalyzing reactions and sensing the intracellular environment.

The ribosome is a complex macromolecular structure composed of 55 different proteins and RNA molecules, organized into 50S and 30S subunits, along with 5S, 23S, and 16S ribosomal RNA (rRNA) The rRNA plays a crucial role in binding to messenger RNA (mRNA), initiating translation, and facilitating the polymerization of amino acids Additionally, transfer RNA (tRNA) translates the nucleotide sequence of mRNA into the corresponding amino acid sequence of proteins, relying on Watson-Crick interactions between the RNA molecules for this information transfer.

Translation initiation occurs when the 30S ribosomal subunit, which includes 16S rRNA and the initiator tRNA, attaches to the 5' UTR of an mRNA transcript Enhanced interactions that boost the affinity between the 30S subunit and the 5' UTR can accelerate translation initiation A key interaction is the Watson-Crick RNA:RNA base pairing between the 3' end of the 16S rRNA and the Shine-Dalgarno sequence in the 5' UTR, which is crucial for translation initiation in gram-positive bacteria In contrast, gram-negative bacteria rely on the ribosomal S1 protein within the 30S subunit, which interacts strongly with AU-rich sequences in the 5' UTR.

The ribonucleotide sequence AUG plays a crucial role in protein synthesis by binding to the initiator tRNA and marking the start site This interaction aligns the mRNA within the 30S ribosomal subunit, facilitating the binding of the larger 50S subunit and initiating translation Additionally, three translation initiation factors—IF1, IF2, and IF3—attach to the assembled ribosome to mediate essential conformational changes for the process to proceed.

Ribosomes can initiate multiple rounds of translation on the same mRNA transcript, allowing for the simultaneous synthesis of numerous proteins In bacteria, the absence of a nuclear membrane enables translation to commence immediately after the transcription of the 5' UTR, resulting in concurrent transcription and translation processes occurring in close proximity.

The binding of the larger 50S ribosomal subunit secures the mRNA transcript between the two ribosomal subunits, creating the ribosomal elongation complex This process of translational elongation is characterized by a series of efficient and endothermic chemical reactions, marked by repetitive conformational changes and Watson-Crick RNA:RNA base pairing, all orchestrated by the ribosome, mRNA, and tRNA molecules.

The Regulation of Translational Interactions 00

The production of proteins through mRNA translation is primarily regulated during the initiation and elongation phases The 5' UnTranslated Region (UTR) of an mRNA transcript plays a crucial role in determining the basal translation rate, similar to how promoter DNA influences transcription Key sequence determinants within the 5' UTR facilitate essential RNA-RNA and Protein-RNA interactions that govern ribosome assembly and translation initiation Additionally, the coding sequence of the mRNA can impact translation elongation, albeit to a lesser extent Furthermore, various proteins and RNA molecules, known as translation factors, can bind to the 5' UTR or coding regions to modulate translation rates effectively.

mRNA transcripts are generated as single-stranded RNA but can utilize RNA:RNA Watson-Crick base pairing to create RNA duplexes In contrast to double-stranded DNA, double-stranded RNA tolerates bulges, hairpins, and mismatched base pairs more easily This allows for robust RNA:RNA interactions to form in solution with as few as four ribonucleotides, leading to the development of large, stable secondary structures Notably, unlike protein structures, it is feasible to compute potential RNA secondary structures using various modern computational techniques.

[190] For our purposes, we use the successor to Mfold, called UNAfold [191], to calculate the minimum free energy (MFE) of an RNA secondary structure or RNA duplex.

The primary mechanism for ribosome binding to the 5’ end of mRNA and the initiation of translation involves Protein-RNA and RNA-RNA interactions between the mRNA's 5’ UTR and the 30S ribosome subunit The 30S subunit attaches to the mRNA at the ribosome binding site (RBS), which includes a U-rich sequence or the Shine-Dalgarno (SD) sequence (UAAGGAGG), followed by a minimum 5 bp spacer, the initiation codon AUG, and an additional 5-10 base pairs known as the downstream box The U-rich sequence interacts with the ribosomal S1 protein, while the SD sequence pairs with a complementary region in the 3’ end of the 16S rRNA, facilitating ribosome assembly at the translation start site Any deviations from the optimal SD or U-rich sequences or improper spacing can significantly decrease the rate of translation initiation.

Additional RNA:RNA interactions can inhibit translation initiation by sequestering U-rich or Shine-Dalgarno (SD) sequences A common mechanism involves the formation of mRNA secondary structures in the ribosome binding site, which obstruct the SD sequence from interacting with 16S rRNA These secondary structures typically include hairpins of four or more nucleotides, exhibiting Gibbs free energies ranging from -1 to -3 kcal/mol Table 3.4 illustrates nine distinct ribosome binding sites, concluding at the start codon, alongside the corresponding Gibbs free energies of their rRNA:mRNA interactions.

Table 3.4 presents a curated list of ribosome binding sites (RBSs), featuring DNA sequences that begin with AGGA and conclude with a start codon These sequences are ranked based on their translation efficiency, measured by the average number of proteins produced per mRNA transcript, with the highest efficiency rated as one Additionally, the table includes the Gibbs free energies associated with mRNA folding into secondary structures and the hybridization of rRNA:mRNA, calculated using UNAfold for comparative analysis.

RBS Sequence | Ranking | AG Fotding kcal/mol | AGpypyig kcal/mol

The mRNA secondary structure exhibits dynamic folding and unfolding at equilibrium, with the likelihood of it being in a folded state determined by the Boltzmann factor This behavior is crucial for understanding mRNA:mRNA interactions, particularly in the context of AGGACGGCCGG ATG 9 -3.l -11.2.

As the stability of mRNA secondary structures increases, they are more likely to exist in a folded state, which sequesters the Shine-Dalgarno (SD) sequence from binding to the 30S ribosomal subunit This results in the basal rate of translation initiation being influenced by the competition between mRNA secondary structure formation and the interactions between proteins or rRNA and mRNA However, the precise relationship between the rate of translational initiation and the ribosome binding site (RBS) sequence remains unclear, partly due to the ribosomal S1 protein's affinity for U-rich sequences and other less defined factors.

Translation Factors: Proteins and RNAs

Both proteins and RNA molecules can bind to the ribosome binding site, sequestering Shine-Dalgarno (SD) or U-rich sequences, thereby hindering ribosome assembly Regulatory RNAs, such as short-hairpin RNAs and microRNAs, form strong RNA:RNA duplexes with complementary sequences in the ribosome binding site, effectively repressing translation Additionally, bacteria rapidly degrade long RNA:RNA duplexes, especially those from viruses with double-stranded RNA, leading to targeted degradation of the mRNA transcript and halting translation.

Short complementary RNA sequences can form duplexes that sequester target sequences without inducing degradation, allowing them to act as specific translational repressors for any ribosome binding site (RBS) sequence.

Codon Usage, Translational Pausing, and Frameshifting

During translation initiation, the ribosome's elongation process can be influenced by the mRNA coding sequence, particularly through its codon usage and the presence of extended repeats of uracil (U) or adenine (A).

The genetic code is considered degenerate due to the presence of 64 possible codons that correspond to only 20 natural amino acids, leading to redundant codons that incorporate the same amino acid into a polypeptide Evolution has resulted in higher concentrations of amino acyl-tRNAs for frequently used codons, while rare codons have lower concentrations This disparity can slow the translation elongation rate for rare codons, often causing translational pausing due to the limited availability of the necessary amino acyl-tRNA Thus, the frequency of codon usage is closely linked to the rate of translational elongation.

Analyzing the frequency of codons in an organism's genome allows for the construction of a codon table, which estimates the translational elongation rate for each codon These codon tables are publicly accessible on various websites, such as http://www.kazusa.or.jp/codon/, and vary between different organisms.

The frequent use of rare codons can lead to extended translational pauses and may cause ribosome subunit dissociation To enhance translation efficiency when transferring genes between organisms, it's typical to codon optimize the mRNA by substituting rare codons with more common ones However, for multi-domain proteins, translational pauses at domain boundaries are crucial for proper folding of individual domains Thus, strategically placed rare codons in the mRNA sequence can play a vital functional role.

Repeated sequences of the same nucleotide, accompanied by stable secondary structures like hairpins, can lead to ribosomal skipping of nucleotides, resulting in frameshifts These frameshifts may cause downstream stop codons to either appear or disappear Both bacteria and viruses exploit frameshifting to diversify the composition of multi-domain proteins, utilizing stop codons to conditionally or randomly incorporate additional protein domains.

Messenger RNA and Protein Degradation and Dilution

The steady-state concentrations of mRNAs and proteins in bacteria are influenced by their production rates through transcription and translation, as well as their degradation and dilution rates By modifying the degradation rates of mRNA transcripts or proteins, bacteria can swiftly adjust their concentrations without affecting transcription or translation The degradation rate is determined by sequence features and often relies on the formation of secondary or tertiary structures that interact with the degradation machinery By altering these sequences, bacteria enhance their control over mRNA and protein production Additionally, other RNA and protein molecules can regulate degradation rates by binding to targets, either blocking access to degradation machinery or promoting stronger interactions that lead to increased degradation During the exponential growth phase, however, cell division significantly impacts the steady-state concentrations of mRNAs and proteins, as bacteria replicate and distribute their cytoplasmic contents to daughter cells.

The dilution rate, influenced by the cell growth rate, is a crucial factor in determining the steady-state concentration of bacteria Various conditions, such as nutrient and oxygen availability, the presence of antibiotics or toxic chemicals, and whether bacteria are growing in solution or on an agar plate, can significantly impact the cell growth rate Typically, the dilution rate exceeds the degradation rate of bacterial proteins, highlighting its importance in bacterial population dynamics.

Cell replication is a complex process involving DNA replication, separation of the bacterial chromosome, and invagination of the plasma membrane to form two distinct daughter cells While the dilution rate serves as an approximation, a more precise understanding of cell replication characterizes it as a random discrete event.

RNAses and the RNA Degradosome

The degradation of mRNA transcripts relies on the coordinated actions of various endo and exonucleases, along with auxiliary proteins, which together form the RNA degradosome complex RNA endonucleases cleave the mRNA in the middle, while exonucleases target the tri- or mono-phosphate ends, degrading the transcript from 3’ to 5’ In bacteria, the identified RNAse exonucleases specifically bind to mono-phosphate or poly-adenylated 3’ ends These enzymes work in a "cut and chew" mechanism, where endonucleases generate new 3’ mono-phosphate ends, allowing exonucleases to sequentially degrade the transcript until they either exhaust the material or encounter obstacles from mRNA secondary structures.

The initial nucleolytic attack on the mRNA transcript is a crucial rate-limiting step, causing ribosomes to detach and resulting in unproductive translation However, unaffected coding regions with intact ribosome binding sites can still engage ribosomes, enabling productive translation This location-specific degradation of mRNA transcripts, which may contain multiple start sites and coding regions, leads to varying effective half-lives and differing rates of protein production Such variations allow for the balancing of protein concentrations within an operon, essential for the formation of multimeric complexes or for optimizing metabolic pathways.

The degradosome in Escherichia coli consists of RNAse E, PNPase, an auxiliary helicase, and enolase, alongside RNAse II, HI, and PAPI, which are crucial for the regulated degradation of mRNA transcripts RNAse E functions as an endonuclease, specifically targeting single-stranded RNA for cleavage.

RNAse E exhibits greater efficiency when binding near the 5’ end of mRNA, although its exact sequence and structural specificity remain uncertain RNAse II, another endonuclease, interacts with specific double-stranded RNA structures known as proximal and distal boxes Various mRNA secondary structures can hinder the binding of RNAse E and III, thereby decreasing their endonucleolytic activity Additionally, exonucleases such as RNAse I, PNPase, and PAP1 collaborate to polyadenylate and degrade mRNA transcripts in a 3’ to 5’ direction The activity of these enzymes can be influenced by stable mRNA secondary structures, complementary regulatory RNA molecules, or the presence of elongating ribosomes Notably, the 3’ UTR regions of mRNA often contain small secondary structures that protect them from rapid degradation, while ribosomes can temporarily occlude RNAse binding sites, particularly in efficiently translated transcripts.

Peptide Tags and the Proteosome

The degradation of cytoplasmic proteins involves a complex known as the proteosome, which consists of various protease and substrate-binding proteins In bacteria, there are four distinct families of proteosomes: ClpAP/XP, ClpYQ (HslUV), Lon, and FtsH, each comprising an ATPase and a proteolytic subunit These protease subunits create a large multimeric complex featuring multiple active proteolytic sites within a central chamber The ATPase domain plays a crucial role in binding to target proteins, unfolding them into a disordered form, and facilitating their translocation into the chamber through an ATP-dependent process, where they are subsequently degraded into small peptides.

Bacterial proteins with binding sites for the ATPase domain in proteosomes are targeted for degradation more quickly, often through short peptide sequences located near their C- or N-terminals A notable example is the ssrA tag, an 11 amino acid peptide that, when attached to a cytoplasmic protein, facilitates its rapid degradation by ClpXP, ClpAP, or FtsH proteosomes The presence of the ssrA tag can reduce a protein's half-life by approximately tenfold This mechanism allows bacteria to swiftly degrade partially translated proteins when ribosomes stall The ssrA RNA, which resembles tRNA, binds to these stalled ribosomes, replacing the RNA sequence with the ssrA tag, thus completing protein translation while appending the ssrA peptide at the C-terminal end.

The half-lives of proteins can be regulated by adaptor proteins that enhance interactions between targeted proteins and proteosomes By dynamically producing these adaptor proteins in response to environmental or metabolic changes, proteosomes can quickly bind to and degrade specific proteins, thereby lowering their steady-state concentration in the cell A notable example of this mechanism is the general stress response in E coli, where the recognition factor RssB facilitates the preferential degradation of the sigma factor 6°

3.3 The Modeling of Gene Networks

The goal of modeling a system of regulated genes, called a gene network, is two-fold:

1 to accurately capture the dynamics of the regulated production of RNA and protein molecules

2 to connect experimental modifications of the molecular interactions in the system to changes in the kinetic constants of the model and, consequently, the system level behavior of the model

Our objective is not to model every molecular interaction and protein conformational change but to make simplifying assumptions when necessary These assumptions are not made for the sake of simplifying mathematical equations; rather, they are employed when modern techniques cannot experimentally distinguish between different interaction series, particularly during rapid and coordinated conformational changes of large proteins As a result, the outcomes of these models are directly applicable and comparable to experimental findings, which is our primary aim.

We analyze gene networks by deconstructing their molecular interactions—specifically protein-DNA, protein-RNA, protein-protein, and RNA-RNA—into a framework of chemical and biochemical reactions governed by mass action rate laws The kinetic parameters and thermodynamic free energies used in our model are primarily derived from empirical measurements found in existing literature Additionally, we consider the formation and dissolution of stable non-covalent bonds, such as the protein-DNA complex, as standard biochemical reactions.

Our model functions like an algorithm, enabling the representation of a sequence of DNA with specific genetic components through a defined set of rules, akin to programming This approach remains consistent even in complex gene networks, as the model-generating algorithm is systematically applied to every molecular interaction within each gene, encompassing regulatory processes in transcription, translation, mRNA and protein degradation, as well as enzymatic reactions that often serve regulatory functions, such as phosphorylation and methylation As a result, we can outline the reaction system for a generalized gene influenced by transcription or translation factors, applying the same algorithm across all genes within the network.

The reaction system incorporates unique species for each DNA operator, promoter, ribosome binding site, and other specific regions in DNA or RNA, enabling precise documentation of Protein-DNA, Protein-RNA, and RNA-RNA regulatory interactions This approach eliminates the need for Hill kinetics and arbitrary rate laws, ensuring clarity in molecular interactions Notably, the interactions between transcription or translation factors and their binding sites are context-free, meaning that relocating a binding site does not affect the kinetics of the interaction Additionally, cooperative binding among transcription or translation factors is explicitly represented through attractive interactions between neighboring binding factors.

Kinetics and Equilibrium Data 2 ee ee 160

Before analyzing the model's reactions, it's essential to understand the interplay between kinetic, equilibrium, and thermodynamic data, along with their empirical measurement and necessary approximations Each reaction in the model is associated with a kinetic rate that indicates how quickly reactant molecules associate and how covalent bonds or stable non-covalent interactions are formed or broken In biological systems, the complex interactions between reactant molecules require empirical measurements of their kinetics Therefore, it is crucial to include only those reactions in the model for which kinetic or equilibrium data can be reliably obtained.

In a bi-molecular reaction characterized by forward and backward kinetic constants, denoted as k and k’, the equilibrium dissociation constant (Ky) is defined by the ratio of these constants (Ky = kP / k) Conversely, the equilibrium association constant (K) is the inverse of the dissociation constant (K = 1/Ky) These equilibrium constants play a crucial role in understanding the dynamics of the reaction.

Gibbs free energy of binding or reaction, AG, between the two molecules via

Ky = exp tà (3.1)AG and AG

Experimental techniques such as surface plasmon resonance, electromobility gel shift assays, and fluorescence-based tagging enable the empirical measurement of kinetic and equilibrium constants for protein binding to DNA or RNA This includes the formation of holoenzymes on promoter DNA and the interactions of transcription factors with their operator sites Since the equilibrium constant and Gibbs free energies rely solely on the relative concentrations of bound and unbound states of the nucleic acids, equilibrium data is more accessible and frequently reported compared to kinetic data.

In cases where only equilibrium data is available, it is feasible to assume that the forward rate of a large protein binding to its DNA or RNA site is diffusion-limited By utilizing the protein's size, one can calculate its forward binding kinetics and subsequently determine the backward binding kinetics using equilibrium data This approach employs a fundamental description of two particles diffusing in three-dimensional space, with the Smoluchowski rate for a diffusion-limited reaction given by kf = 4nDa, where D represents the diffusion coefficient and a is the target size The diffusion coefficient for a free particle of diameter d in a homogeneous fluid can be derived from Einstein’s relation.

Plugging in Eq 3.4 into Eq (4.2), we obtain the kinetic rate of association in terms of temperature,

T, the viscosity of the fluid, n, and the ratio between the diameter of the protein and its target site, a/d, so that

Assuming an aqueous fluid at physiological temperature (T = 30°C) results in an association rate of k* = 3.3590 x 10° [M s]~! Given that the protein size is significantly larger than the DNA binding site, we can approximate the association constant as k/ ~ 10° [M s]~! In the absence of kinetic data but with available equilibrium data, the backward kinetic constant can be estimated as k? = 108 x Ky.

The approximations discussed overlook electrostatic interactions between proteins and their target sites, as well as the potential sliding of proteins along DNA or RNA, which can enhance the diffusion-limited association rate However, if a protein needs to bind to its target in a specific conformation, the association rate may decrease This article will now examine the reaction systems models relevant to the regulated processes in gene expression.

Regulated Transcription © 2 0 ee 162 3.3.3 The Chemical Partition Function and Equilibrium Holoenzyme Formation 164

The production of mRNA transcripts involves three key stages: transcriptional initiation, elongation, and termination, along with regulatory interactions that influence each step's rate In E coli, the basal transcription process is represented by five chemical reactions, utilizing four free kinetic parameters that regulate Holoenzyme formation and the transition from closed to open conformations, while other parameters are determined experimentally The complexity of transcription factor binding interactions is determined by the number of transcription factors and DNA operators present We propose a generalized reaction scheme applicable to any number of activating or repressing transcription factors.

The formation of the Holoenzyme complex on the promoter involves reversible reactions, where the binding sequence of the o-factor and RNA polymerase may vary The o-factor could either attach to the promoter first and then recruit RNA polymerase, or it might bind to RNA polymerase before interacting with the promoter DNA Both scenarios could occur simultaneously, leading us to conclude that the specific order of these binding events is not critical Thus, we define a single complex, known as RNAPo-factor, which binds to and unbinds from the promoter DNA in a dynamic reaction.

The interaction between RNAP-o and the promoter leads to the formation of the RNAP:P complex, influenced by forward and reverse kinetic constants, kh yap and &Z„ap These kinetic constants are determined by the specific sequences present in the promoter DNA and the availability of the active form of the o factor.

Adjacent operators near or within a promoter should be included in the reaction pairs, as they are part of the same contiguous DNA molecule encompassing the promoter and coding sequences By treating these operators as unique species, we can clearly outline the regulatory interactions that happen when each operator is bound by a transcription factor This approach highlights the complexity of reactions involving multiple reactants.

In a bi-molecular reaction with a second-order rate law, the presence of additional species does not affect individual diffusion in space For instance, when two overlapping operators, designated O and Oo, exist within the promoter region, the RNA-o complex can only bind to the promoter when both operators are unbound.

RNAP-ỉ + Promoter + O; + O27 @ RNAP:P:0 1:02 with a 2” order bi-molecular rate law and kinetic constant The operators may bind to repressor transcription factors whose steric hindrance prevents the RNA polymerase from binding.

Conversely, if an adjacent (typically upstream) operator, O,, binds to an activator transcription factor, TF4, with attractive interactions towards the RNA polymerase, the pair of reactions would be

RNAP-o + Promoter + TF4:0; RNAP:P:TFA:O¡ with an increased koa p or a decreased k2„„p quantifying the attractive interactions that either recruit the RNA polymerase-o complex or stabilize its assembly on the promoter.

The assembly of the Holoenzyme complex on the promoter triggers a crucial conformational change from closed to open, facilitating DNA unwinding and promoter escape Due to the complexity of this process, which involves several individually improbable steps, we consider the conformational change to be a single first-order step.

RNAP:P transitions to RNAP*:P with a specific kinetic constant, k;,, The formation of the holoenzyme requires the inclusion of any DNA operators, which must persist in subsequent reactions In scenarios involving two overlapping operators, a conformational change is essential.

Following the conformational change, the Holoenzyme complex must move beyond the promoter region through forward translocation During this process, the interactions between the sigma factor and RNA polymerase weaken, potentially leading to the complete dissociation of the sigma factor, although its release is not monitored The promoter escape phase is represented as a first-order reaction.

The escape reaction of RNAP* involves the regeneration of the promoter species, enabling the assembly of another Holoenzyme at the site Additionally, any operators involved in the assembly must also be regenerated For instance, in the case of two overlapping operators, the promoter escape step is crucial for the continuation of the transcription process.

Once RNA polymerase moves past the promoter region, it initiates transcriptional elongation During this phase, RNA polymerase advances along the DNA, synthesizing an mRNA transcript at a rate influenced by the DNA sequence, known as Kejong This process can be represented through a series of N first-order reactions.

Transcriptional elongation involves the transition from RNAP:DNA¡I to RNAP + mRNA, where N represents the number of nucleotides in the DNA coding sequence, typically ranging from 200 to 1000 base pairs Given the large value of N, it is impractical to detail each translocation step during this process However, the delayed production of mRNA transcripts becomes significant with larger N, necessitating a more efficient model for transcriptional elongation that still accounts for this delay.

In stochastic chemical kinetics, reaction times are modeled as exponentially distributed random variables When multiple events have exponentially distributed waiting times, the total time from start to finish follows a gamma distribution By assuming that the rate of elongation is sequence-independent, we can simplify the N first-order reactions into a single first-order reaction.

RNA polymerase (RNAP) interacts with DNA to synthesize mRNA, characterized by a single kinetic rate, Kejong, and a defined number of steps, NV The average delay time for this transcription reaction is calculated as N/Kejong, while the variance is represented as N/Kon Additionally, the rate of transcriptional elongation is influenced by the cell's growth rate, typically ranging from 30 to 70 nucleotides per second.

The binding and unbinding of the i” transcription factor to its operator DNA site, O;, are represented by a pair of reactions

In the system, each operator site generates additional pairs of reactions characterized by forward and backward kinetic constants, k z and k o ry When two operator sites bind to the same transcription factor, it results in the formation of two distinct pairs of reactions.

When a repressor transcription factor binds to an operator that overlaps with the promoter, it exclusively inhibits RNA polymerase binding This interaction is represented by necessitating the overlapping operator's involvement in the initial assembly of the holoenzyme.

Regulated Translation 2.0 cu ee 166

Like transcription, the regulated process of translation can be modeled by first describing the basal translation process and then adding the regulatory interactions that alter its rates.

The 30S and 50S ribosomal subunits initially attach to the ribosome binding site (RBS) on the mRNA transcript Once assembled, the ribosome initiates the elongation process by translocating forward, which subsequently liberates the RBS This process is represented by two irreversible reactions.

Ribosomes interact with mRNA to form Rib:mRNA complexes, which then lead to the production of mRNAase and Rib:mRNA This process involves kinetic constants for translation initiation (k) and elongation (Ketone), modeled as a series of first-order reactions Given that the number of codons (A) in the mRNA transcript usually ranges from 67 to 667, translational elongation can be effectively described as a single y-distributed reaction.

Similar to transcriptional elongation, the process of translational elongation may be

Rib:mRNA, — Ribosome + Protein sl init’ with a sequence-independent kinetic constant k

Translation Factors and mRNA Secondary Structures

mRNA secondary structures that block access to the ribosome binding site reduce the translation initiation rate We can represent the formation of these secondary structures through reversible reactions, specifically mRNA secondary structure mRNApgs = mRNAJ, with the change in Gibbs free energy calculated using an RNA secondary structure calculator.

In addition to secondary structures, various non-coding RNAs (ncRNAs) and translation factor proteins (SLFs) can bind to the ribosome binding site, limiting its accessibility To model these interactions, we establish reversible reactions: ncRNA + mRNAzss ↔ ncRNA:mRNAggs and SLF + mRNAsss ↔ SLF:mRNAssec, incorporating suitable forward and backward kinetic constants.

mRNA and Protein Degradation and Dluton

In cellular processes, all molecules eventually undergo degradation or dilution due to cell growth This degradation is facilitated by enzymatic reactions, such as those involving the RNA degradosome or proteasome Additionally, molecular species become diluted during cell growth and division For clarity, we treat these two processes as independent Furthermore, we do not account for the changing production rates of RNA polymerase or ribosomes, nor do we include their degradation or dilution effects in our analysis.

We analyze the basal degradation rates of RNA molecules through first-order reactions, exemplified by mRNAass — Ũ ncRNA — 0, where the first-order kinetic constants are determined using half-lives These reactions operate under the assumption of a constant concentration of the RNA Degradosome Notably, the basal degradation rate is influenced by the RNA sequence, showing an increase when RNAse binding sites are present and when sequestering secondary structures are absent.

Similarly, the basal degradation rate of protein products, transcription factors, and translation factors can be modeled by using first order reactions, such as

The basal degradation rate of proteins is influenced by the presence of peptide tags that bind to proteasomes, which are assumed to maintain a constant concentration during these reactions Additionally, the half-lives of proteins are affected when the substrate level factor (SLF) is greater than zero.

The presence of adaptor proteins often increases the degradation rate of RNA or protein molecules.

We can approximately model their effects by treating these adaptor proteins, called a degradation factor (DF), as an enzymatic catalyst for RNA or protein degradation, using the reactions

DF + RNA = DF:RNA — DF

DF + Protein @ DF:Protein — DF

The degradation factor reversibly binds to the RNA or protein molecule and shuttles it to the RNA Degradosome or proteosome for destruction.

The dilution of RNA and protein species can be modeled using two approaches: one treats cell replication as a continuous process, while the other considers it a discrete event In the continuous model, the bacterial cell volume remains constant, and the dilution rate of each cytoplasmic RNA and protein species follows a first-order reaction, akin to degradation reactions.

The kinetic constants for these reactions are uniform, with the cell replication rate represented by ke, calculated as log 2 divided by tacr, indicating an average cell replication rate of facr.

In the second approach, cell replication is modeled as a random discrete event following a Gaussian distribution, with cytoplasmic contents distributed to daughter cells based on a Binomial distribution The bacterial cell volume grows exponentially from an initial volume, V₀, described by the equation V = V₀ exp(ke(t - f)), where k is the kinetic constant and f is the time of the last replication event The timing of the next cell replication is drawn from a Gaussian distribution, with the mean set at the average replication time and a standard deviation typically ranging from 5-10% of this average During this process, the quantities of RNA and protein molecules can either be halved or sampled from a Binomial distribution, where the number of trials equals the number of molecules, and the probability of distribution is 1/2 Additionally, the cell volume may either be halved or reset to V₀.

1 Step 1: Starting at an initial time t, and volume V,, select a Gaussian random variable t, ~

2 Step 2: Between f, and t,, the volume increases continually according to V = V, exp (Ke; (t — fa)}.

3 Step 3: At ¢ =+,, the volume is either reset to V = V, or the volume is reduced by half

The volume is halved (V = V/2), and the quantity of each cytoplasmic chemical species, represented as X;, is drawn from a Binomial distribution, where X; follows the distribution X; ~ Binomial (X;, 0.5) Subsequently, the timing of the last cell replication is reset to ft = ¢, allowing the algorithm to iterate once more.

Using this model, the stochasticity arising from cell replication may also be included.

Introduction 2 ee 169 3.4.2 Molecular Design and Mathematical Methods

The engineering of biological organisms for complex tasks is still developing, with promising applications in extreme genetic engineering This includes modifying bacteria to detect trace chemicals like TNT and employing gene therapy to address various human diseases, such as diabetes and cancer, by inserting corrective DNA that produces essential therapeutic proteins at the appropriate times.

Synthetic gene networks have been developed by reorganizing naturally occurring molecular components, including transcription factors, mRNA hairpins, and DNA operator and promoter sites, into innovative configurations These networks demonstrate various potentially beneficial dynamic or logical behaviors.

[33, 34] In addition, by creating entirely new genetic components, including polydactal zinc fingers

By utilizing chimeric activator or repressor fusion proteins, libraries of novel bacterial promoters with adjustable basal expression, and RNA molecules that create small molecule-binding secondary structures, we can enhance our genetic toolbox This advancement allows for the sensing of new molecular signals and the development of innovative phenotypes.

Synthetic gene networks showcase a variety of innovative designs, including bistable switches, oscillators, switch-oscillators, cascades, and feedback loops Examples include transcriptionally and translationally regulated systems, as well as population-dependent activation in bacteria and yeast These prototypes of synthetic genetic programs are capable of producing increasingly complex behaviors As research progresses and more molecular components are understood, a key objective is to uncover additional gene and protein networks that demonstrate novel or enhanced functionalities.

However, it is still not well understood how the molecular interactions between DNA binding sites, RNAs, and proteins influence the dynamics of gene expression and the resulting phenotypes.

In order to understand how these interactions affect the overall dynamics, one may develop mathematical models that quantitatively describe all of the significant molecular interactions in the syn-

Modeling the 142 thetic gene network involves comparing results from various models, including Boolean networks, graph networks, and jump Markov processes, to experimental observations The primary goal of this modeling is to understand critical molecular interactions within the system, linking molecular events to observable behaviors Additionally, it aims to analyze how minor changes influence the system's dynamics and to identify necessary modifications to achieve specific dynamical outcomes.

Computational biology encompasses a wide range of topics, with numerous applications in understanding genetic regulation Research has utilized the chemical partition function to analyze how various genetic components, such as activator and repressor transcription factors, operator placement, and DNA-looping, influence transcription regulation Additionally, mathematical analyses have been conducted on rationally designed gene networks, including those featuring mixed feedback loops.

Synthetic gene networks have been developed to investigate the virulence cycle of HIV and to mitigate the progression of AIDS These networks are also used to explore various biological phenomena, such as the impact of stochastic resonance and noise on cellular memory, calcium oscillations, neuron dynamics, enzyme futile cycles, and quorum sensing Furthermore, optimization techniques, including evolutionary computation and simulated annealing, can identify synthetic gene networks with specific desired behaviors This overview highlights key studies in the field but is not exhaustive.

In our research, we analyze regulated gene expression interactions through a framework of chemical and biochemical reactions, applying stochastic process theory to explore the system's stochastic dynamics This stochasticity arises from the limited quantities of molecules involved, such as regulatory DNA-binding proteins, and has been experimentally validated in gene expression studies.

Mathematical models are utilized to quantitatively predict the dynamics of synthetic gene networks in response to environmental and regulatory stimuli These models systematically identify the regulatory connections between genes that lead to specific dynamical behaviors, such as bistable switching and oscillations Additionally, they analyze how the kinetics of interactions among the components quantitatively influence the behavior of the synthetic gene network.

3.1.1 An Overview of the Chapter

This article provides an overview of regulated gene expression, emphasizing the critical interactions between gene expression machinery and DNA/RNA binding sites during transcriptional and translational processes, including initiation, elongation, and termination It explores how DNA and RNA sequences influence the basal rates of these processes, highlighting sequence determinants that affect the kinetics of rate-limiting steps in gene expression Additionally, the article examines the role of regulatory molecules, such as transcription factors and microRNAs, in modulating gene expression rates Finally, it addresses the involvement of the RNA Degradosome and proteasomes in the degradation of RNA and proteins, illustrating how bacteria can selectively target these molecules for expedited degradation.

Physical and chemical modeling offers distinct advantages and limitations; experiments reveal outcomes based on specific configurations of reality, while simulations explore the implications of countless assumed configurations However, not all configurations manifest in reality, and the sheer number of real-world scenarios makes exhaustive experimentation impractical Consequently, a model lacking experimental validation can lead to inaccuracies, whereas experiments devoid of a guiding model yield minimal value Therefore, the integration of both approaches is essential for comprehensive understanding and accurate results.

In this section, we explore the modeling of regulated gene expression in bacteria through a framework of chemical and biochemical reactions Our approach systematically converts key molecular interactions involved in transcription and translation, including protein-DNA, protein-RNA, protein-protein, and RNA-RNA interactions, into a comprehensive system of chemical reactions We empirically measure the kinetics and thermodynamics of these interactions in the laboratory and summarize the available data While a full kinetic mathematical model is typically utilized, we propose that assuming chemical equilibrium for protein-DNA interactions at the promoter can simplify the analysis Consequently, we demonstrate how to apply the chemical partition function as a more straightforward mathematical representation of the mechanistic interactions in Holoenzyme and transcription factor assembly.

In this article, we explore the design of synthetic gene networks that exhibit two distinct functionalities The first example features a protein device composed of fusion proteins that activates gene expression only when two specific transcription factors are present, effectively simulating a Boolean "AND" logic function This system offers several benefits, such as modularity, scalability, high fidelity, and a quick response to inputs By interconnecting multiple protein devices, we can engineer bacteria to react to a defined set of inputs with a predetermined genetic output The second example involves a three-gene system with repressor regulatory connections that generate long-lasting oscillations Our research investigates how the promoter region's structure influences the oscillation's period and amplitude.

3.2 An Overview of Regulated Bacterial Gene Expression

Gene expression is the process through which a cellular organism translates the genetic information in its genome into functional RNA and protein molecules These molecules are crucial for regulating biochemical reactions within the cell, influencing metabolism, growth, signal transduction, motility, and differentiation By modulating gene expression, cells can adapt to their environment, alter internal processes, replicate, and modify their surroundings Ultimately, the regulated expression of genes distinguishes living organisms from non-living chemical systems.

This article reviews the series of biochemical reactions that transcribe DNA into messenger RNA (mRNA) and translate mRNA into proteins, a process known as "the Central Dogma," coined by Francis Crick Focusing on bacterial gene expression, particularly in Escherichia coli, we explore the well-understood biochemical mechanisms that govern this process The discussion highlights various ways cellular organisms can modify gene expression rates in response to internal signals, revealing multiple strategies to regulate RNA and protein production Understanding these mechanisms is crucial for designing effective synthetic gene networks.

Results 2 0 ee 178

Sensitivity Analysis of the Deterministic Model

The interaction between the PPI-ligand and the dissociation constant (K) significantly influences the false positive activation rate, as illustrated in Figure 3.5A At low K values, most scaffold activators are bound to at least one CIP protein, leading to a high sensitivity of the false positive activation rate to rpgp1 due to intense competition between CIP and DBP proteins Even minor increases in rpgp1 can cause substantial rises in the false positive activation rate, reaching a peak As K increases, this peak rate also rises, reaching an optimal range between 1 and 3.5 µM However, further increases in K result in a decline in the maximum false positive activation rate and reduced sensitivity to rpgp1.

Increasing the rpgp at high Ky levels results in a more linear rise in the false positive activation rate To effectively reduce this false positive rate at elevated rpgp values, selecting either a low concentration of 0.01 uM or a higher range of 50 to 100 uM for Ky is advisable.

Decreasing the AGp of DNA-binding proteins (DBPs) to a more negative value results in fewer complexes binding to DBPs, such as C5 or C7, which enhances the rate of transcriptional initiation To avoid excessive false positives, individual DNA-binding domains must bind to their operator sequences with affinities close to non-specific binding (AGp = -6.5 to -7 kcal/mol) When both DBPs are present, the AG from their combined interactions is doubled, enabling complex C8 to specifically bind to two operators and significantly increase gene expression transactivation compared to complexes C5 or C7 This approach utilizes a scaffold activator to form a dimerized transcription factor, where individual monomers may not bind specifically to their operators alone, employing modular protein-protein interaction domains rather than specialized surface interactions for the desired outcome.

Competitively inhibiting proteins (CIPs) are essential for achieving an AND-like response in gene expression In the absence of CIPs, a single DNA-binding protein can enable the scaffold activator to bind DNA and activate gene expression excessively, leading to potential design failures, despite having a lower affinity compared to a scenario with two DNA-binding proteins To mitigate this risk and maintain a high rate of gene expression, CIPs must be produced continuously at a rate of 1 to 4 proteins per second, totaling approximately 2500 to 10,000 proteins within the cell This substantial presence of CIPs alters the equilibrium of the scaffold activator towards a state that favors CIP binding, necessitating a greater number of DNA-binding proteins to effectively compete for the limited available resources.

peri he ÔÔÔÔÔÔÔ Ôn nh

FDspa (proteins/second) 0011 ase 2 Pia Ri HÔNG GạNggau a re ee es

The steady-state response of an AND protein device reveals the transcriptional initiation rate influenced by DNA-binding protein production rates (rpgp and rpgp2), with contour lines indicating activation levels at 25%, 50%, 75%, and 99% of the maximum The operating range is defined by the non-shaded region, indicating acceptable levels of rpgp and rpgp2 that prevent excessive false positive transcriptional activation, especially when the production of one DNA-binding protein ceases By utilizing higher affinity peptide ligands for scaffold activator binding sites and lower affinity ones for DNA-binding proteins, similar effects can be achieved at reduced rates At baseline parameters, the low constitutive production of scaffold activators leads to competition for binding sites between CIP and DBP proteins, which can increase false positive activation at higher production rates To mitigate this, decreasing the scaffold activator production rate to approximately 0.01-0.02 proteins/second ensures that nearly all scaffold complexes are occupied by CIP proteins, provided that both DBP proteins are present in adequate quantities.

A sensitivity analysis of the deterministic model reveals that the affinity between PPI domains and peptide ligands must be either low or high to maintain a practical operating range Specifically, a small rs4 combined with a high Fc¡p reduces the false positive activation rate while expanding the regulatory inputs' operating range Additionally, a weak binding interaction between DNA-binding proteins and their operators is essential; otherwise, scaffold activators that bind exclusively to one type of DNA-binding protein, such as complexes C5 and C7, may interact with DNA too strongly, leading to high false positive transactivation rates.

By analyzing overall trends, we select molecular components and their production rates to achieve high fidelity and effective protein devices We begin by choosing a protein-protein interaction (PPI) domain and a peptide ligand with a dissociation constant (Kz) of 50 µM, while maintaining the baseline value for AGp These relatively weak interactions, which are frequently encountered in nature, facilitate the engineering of synthetic proteins.

To achieve a high-quality protein device, it is essential to adjust the production rates of scaffold activators and competitively inhibiting proteins, decreasing the former while increasing the latter by a specific factor.

Figure 3.7 illustrates the operating ranges and transcriptional initiation rates achieved by combining two or three modifications A simultaneous decrease in rsq and an increase in rc¡p leads to a broader operating range while maintaining high transcriptional initiation rates Conversely, increasing K while decreasing rs4 or increasing rc¡p expands the operating range but diminishes the accessible transcriptional initiation rate, particularly at lower rpgp and rpgp2 values The combination of all three modifications yields the largest operating range, facilitating adequate transcriptional initiation at medium to large rpgp and rpgp2 levels.

By utilizing two super-binding PPI domains with a dissociation constant (Ky) of 0.01 µM, we can enhance the fidelity of the AND protein device With baseline values for rs4 and zc¡p, the AGp can be reduced to -7.5 kcal/mol while maintaining an adequate operating range Consequently, at these K values, the AND protein device achieves optimal fidelity by activating gene expression with minimal quantities of both DNA-binding proteins.

In molecular biology, the interplay of 25 molecules can create a switch-like mechanism for gene expression, which is activated when a specific DNA-binding protein is present Conversely, the absence of this protein results in the deactivation of the process For scenarios requiring a more gradual activation of gene expression, a higher dissociation constant (Ky) is favored, allowing for a more controlled response.

Steady-state Response of the Stochastic Model

We study the effects of stochasticity in the protein-protein interactions by relaxing the deterministic approximation, describing the kinetic description using a Master equation, and solving for the

AG, =~12 kcal/mol rà ~Š -ơ wee TM CAN NO HO mHHỐY TT

F _ TM a oe NỚNGG JANHG TÓM 0N =Ỷ B pommel 9HN:

The study examines the false positive transcriptional initiation rate in relation to the affinity (K) of protein-protein interaction domains with their peptide ligands, while varying the production rate of DBP (rpgp) from 0.1 to 4.0 proteins/second The findings reveal a biphasic relationship, where the false positive rate initially rises sharply to a peak between 1 and 3.5 µM of K, before gradually declining to baseline levels Additionally, the analysis shows that small decreases in AGp result in significant increases in the false positive activation rate.

The false positive transcriptional initiation rate (rr;) is influenced by the production rates of a DNA-binding protein (rpgp;) and competitively inhibiting proteins (rczp), with a significant rc¡p necessary to reduce false positive activation Additionally, as rpgp; increases, the false positive activation rate rises sharply with small increments in the scaffold activator production rate (rs4).

Pret’ err ek nos NOS 7 z

OOMereriss œ 04Ê) : a me oO q - as _— af + by oS ¢ Ề hte o 02 r 0ˆ 06 08 J 08?” 04 TH Lượng eee tả tim @ | Be 4 a AY tp} H i (c),, DBP1 {P (d) ơ DBP1 (p/s)

06 1 o G2 oe On oa 6 04 06 fDBP1 (p/s) FDBP1 (p/s)

Figure 3.7 illustrates the operating range for DNA-binding proteins, rppgpị¡ and rppp2, alongside their maximum transcriptional initiation percentages across four distinct molecular component sets and production rates The deviations from baseline parameters are detailed, highlighting specific production rates and concentrations for each condition The study focuses on the steady-state response, revealing the transcriptional initiation rates and their stationary probability distribution Additionally, it compares the baseline and high K,° molecular component behaviors by analyzing their transcriptional activation rates at varying DNA-binding protein production rates The main objective is to explore how fluctuations in the binding and unbinding of PPI domains to their ligands influence the required components to meet design goals.

The distribution of false positive transcriptional initiation for both baseline and high Ky molecular components is approximately Gaussian, centered around the deterministic average rate Although the average operating range aligns with the deterministic model, the AND protein device should rarely exceed the false positive transcriptional initiation rate To ensure that the likelihood of false activation remains below 0.05, the maximum operating range of regulatory inputs must be reduced, resulting in a decrease from 0.066 p/s to 0.025 p/s for baseline components and from 0.79 p/s to 0.30 p/s for high K components.

Discussion ee 185

Conclusion and Outlook 0 00.0000 0000002000008] 190

Future applications such as living biosensors and gene therapies will necessitate intricate regulatory mechanisms to precisely control gene expression, ensuring the correct proteins are produced at the appropriate times Additionally, gene expression must be capable of being activated or deactivated in response to various environmental, regulatory, and metabolic signals The versatility and modular design of interacting fusion proteins, known as protein devices, make them valuable tools for synthetic biologists, enhancing their ability to manipulate biological systems effectively.

The programming of gene expression can be achieved through a set of responsive Boolean instructions, with a focus on characterizing modular protein domains, particularly protein-protein interactions and DNA-binding domains Given the vast array of potential combinations, quantitative modeling will serve as a valuable tool for constructing protein devices that yield specific Boolean behaviors Additionally, this modeling approach will aid in effectively linking multiple protein devices to engineer complex biological programs with multiple inputs and outputs in organisms.

Appendix Text: Notes on the Quantitative Model

The quantitative model employs a two-step approach to forecast transcriptional initiation rates based on protein-protein and protein-DNA interactions It assumes a well-stirred, isothermal environment with a bacterial volume of 1.0x10^7 liters.

The protein-protein interactions in the system are modeled using mass action kinetics, incorporating first and second order rate laws This includes 12 reversible reactions, equivalent to 24 unidirectional reactions, that illustrate the binding and unbinding dynamics between the scaffold activator and various scaffold-binding proteins, including DNA-binding proteins and competitively inhibiting proteins In total, there are 13 unique species, capturing all potential interactions among the scaffold-binding proteins (DBP, DBP2, CIP, CIP2).

(a) Activate if: DBP1 and (DBP2 or DBP3)

DBP1 and DBP2 and DBP3

Protein devices can effectively regulate transcriptional initiation by employing compound Boolean behaviors By integrating extra DNA-binding domains with existing peptide ligands, these devices exhibit AND-OR functionality, responding to either three or four regulatory inputs Additionally, incorporating a protein-protein interaction (PPI) domain along with a peptide ligand enables the device to operate under an AND-AND behavior, utilizing three regulatory inputs.

The protein device effectively represses transcriptional initiation through the removal of the transactivation domain and the strategic placement of operators around the promoter, demonstrating NOT AND behavior Additionally, by utilizing appropriately positioned operators and DNA-looping, the device harnesses long-range interactions to achieve AND Boolean behavior for activation or repression Each protein device (a-e) is designed with one competitively inhibiting protein corresponding to each PPI domain.

Table 3.8: The reaction network describing the protein-protein interactions between scaffold and scaffold-binding proteins.

SA +CIP; >C2 SA+CIP2—C4 SA+DBP; ~C5 SA+DBP2 — C7

C2+CIP; >CI C2+DBP¿ạ›C3 C4+CIP; —-Cl C4+DBP; — Có

CS+CIPa Có C5+DBP2—C8 C7+CIP}; ~C3 C7+DBP; —C8

Table 3.9: The reaction network describing production and degradation of scaffold and scaffold- binding proteins.

— SA — DBP, — DBP; — CIP; — CIPằ

The scaffold activator (SA) can bind to 0 DBP and 0 CIP, resulting in a total of 9 possible complexes, including the free scaffold activator, and 4 scaffold-binding proteins This reaction network is illustrated in Figure 3.3B.

In protein-protein interactions (PPIs), each reaction is characterized by a forward binding kinetic constant (k' in [uM sec]⁻¹) and a reverse binding kinetic constant (k⁻ in sec⁻¹) The affinity (K_y in [uM]) is defined by the equation K_y = k⁻/k' This affinity is influenced by the specific interaction domain of the protein and the peptide ligand involved For the sake of simplification, we assume that the affinity remains consistent across PPI domain-ligand interactions, with a standard value of K_y set at 1 x 10⁻⁷ [M sec]⁻¹.

The diffusion-limited protein-protein association rate for two large proteins is approximately 1.0 [uM] Adjusting the kinetic constant K affects the kÍ and k? constants, but in the deterministic kinetic model, these changes do not impact the steady-state solution Conversely, in the stochastic kinetic model, modifying kÝ or k° can lead to slight variations in the resulting steady-state distribution.

The article discusses five inflow reactions for various chemical species, including SA, DBP;, DBP>, CIP,, and CIPằ, with production rates represented by rs4, rpgp1, rpgp2, rc¡pL, and rcjp2 It is assumed that FCIP1 equals 'crp2, which is equal to Fcip Additionally, each chemical species in the network experiences a 1% order dilution rate characterized by the dilution constant k,,,; The transport dynamics of proteins in and out of the open system are clearly articulated.

In the deterministic version of the kinetic model, we convert the system of chemical reactions with transport into ordinary differential equations (ODEs) The resulting 13 ODEs are: e (scaffold-binding proteins)

The article presents a series of equations involving variables such as APBP, KP, DBP, and various constants and coefficients It emphasizes the relationships between these elements, illustrating how they interact within the formulas Key components include the summation of terms like [C5], [C8], and [C6], as well as the application of coefficients like kf and Kou The equations also indicate the importance of factors such as SA and the role of different variables in calculating outcomes Overall, the content focuses on the mathematical relationships that define the system being analyzed.

[DBP | + [CIP] + [CIP2]) + rsa — Kou * [SA] e (single bound complexes) đt: = KP + ((C1] + [C3] — ÍC2]) — kf x [C2] x ([DBP] + [CIPs]) +k x [SA] * [CIPI]— Kou * [C2]

AC4 = Kb x [Cl] + [C6] — [C4]) — kí x i 4] ằ ([DBPI] + [CIPi]) + KỶ x [SA] x [CIPy] — kour * [C4]

ACS] = K x (|C8] + [C6] — [C5]) — Kí [C5] x ([DBP2] + [CIPs]) + kf x [SA] ô [DBP\] — Kou * [C5] ACT = KP ([C8] + [C3] — (C7) — KÝ x [C7] x ([DBPI] + [CIPi]) + kf x [SA] +[DBP›] — Kou * [C7]

(3.13) dt — KẾ [C2] ô |C1P›] + hf ô [C4] x [CIP,] — 2x e * oe — Ra * [C1]

So) = kỉ *ÍC2]+|[DBP›] + KỈ x [C7] * ÍCIP] — 2k? % [C3] — Kou * [C3] (3.14) cả” = kf [C4] ằ [DBP] +! x [CS] x [CIP] — 2k? + [C6] — kow * [C6] |

ACS) = KÝ x(C7] *[DBPI] +k x [CS] x[DBP›] — 2x i * [C8] — Kour * [C8]

Using Matlab's fsolve function, we determined the steady-state solution of the system by setting the rate of change of each species to zero The analysis consistently yielded a residual close to machine epsilon, indicating a single solution The steady-state solution, represented as a 13 x 1 state vector C, reflects the concentrations of each chemical species and is influenced by the model parameters: Ka, FỨSA, FDBPI, 'DBP2, and FCỊP.

In the stochastic version of the kinetic model, we utilize a jump Markov process to represent the system, with its time evolution described by the Master equation To simulate the stochastic dynamics of coupled chemical or biochemical reactions, we employ Hy3S, which features an adaptive hybrid jump/continuous algorithm Using the provided graphical user interface (GUI), we input the relevant reactions and kinetic parameters into the network, setting the volume to 1.0x10^7 liters and conducting 10,000 independent trials with suitable initial conditions.

The simulation algorithm generates a NetCDF data file after reading input data and conducting 10,000 independent trials of the stochastic dynamics of the reaction network The resulting solution data is imported into Matlab, where the probability distribution of each chemical species is calculated To ensure accuracy, we verify that the joint probability distribution at the final time reflects the steady-state distribution by assessing the rate of change, confirming it approaches zero Ultimately, this process yields a steady-state ensemble representing the number of molecules for each chemical species.

SME (a 10000 x 13 matrix), which is a function of the model parameters.

The second part of the quantitative model focuses on a chemical partition function that characterizes protein-DNA interactions and the initiation rate of transcription This function accounts for all potential configurations within the system and utilizes their respective energies to establish equilibrium probabilities The partition function is essential due to several factors: the limited number of configurations for bound and unbound DNA sites, the inadequacy of approximating single DNA site concentrations as continuous functions, the stochastic nature of DNA-binding protein interactions, and the high computational cost associated with simulating these processes Therefore, we assume that protein-DNA dynamics operate under chemical equilibrium Detailed configurations and their free energies can be found in Appendix Table.

I The values of the free energies are described in the main text The probability of finding the system at any particular configuration, as a function of the concentration of C5, C7, and C8, andRNA polymerase, is shown in Eq (3.9) Shown in Eq (3.10), the rate of transcriptional initiation

Table 3.10: A list of the configurations, their total binding energies with respect to the reference state, and the density of states per configuration in the chemical partition function.

State | O¡ | Oo P AGiot h (density of states)

7 C5 | - | RNAP | AGp+AG, [CS5][RNAP]

The equation for 9 C8 RNAP | 2AGp + AG, represented as [C8][RNAP], calculates the likelihood of locating the promoter in a transcriptionally active state, multiplied by the transcription initiation rate, which is estimated at kj = 0.1 sec⁻¹.

Previous research has examined naturally oscillating systems through deterministic models and non-linear dynamics techniques, such as bifurcation analysis A notable example is the study of the drosophila circadian rhythm, which serves as a simplified model illustrating the interplay of positive and negative regulatory mechanisms in circadian rhythms.

Recent studies have mathematically modeled the entrainment of synthetic oscillators to bacterial cell cycles However, traditional deterministic methods often rely on approximations regarding the continuity and differentiability of biological reactions, which can be inaccurate for various processes, particularly gene expression In this article, we present a fully stochastic representation of the system's dynamics and employ stochastic simulations to generate an ensemble of trajectories.

Inspired by Elowitz and Leibler's research, we establish a cyclic network of three genes interconnected by negative feedback loops, where the protein products inhibit the expression of the subsequent gene, enabling the potential for sustained oscillations Our quantitative model of the repressilator incorporates elements from the well-studied Jac, tet, and ara operons, utilizing kinetic parameters sourced from existing literature Additionally, a diverse array of Jac and tet DNA sites, along with repressor proteins, has been developed through extensive mutagenesis, each exhibiting unique kinetic properties.

Our approach differs significantly from previously developed models by employing a detailed, mechanistic framework for bacterial transcription and translation, encompassing all protein-protein interactions without oversimplification We model transcriptional and translational elongation as gamma-distributed events, with rates of 30 nucleotides per second and 33 amino acids per second, respectively Utilizing a hybrid stochastic-discrete and stochastic-continuous algorithm, our models provide a more accurate representation of single-cell behavior compared to deterministic kinetics, which assume continuous concentrations This hybrid algorithm effectively simulates the dynamics of both discrete and continuous reactions, addressing the limitations of traditional Langevin approaches While developing a deterministic-continuous model may be simpler for capturing existing concentration profiles, our detailed model aims to establish design rules for novel gene regulatory networks for experimental validation However, this complexity necessitates extensive numerical simulations for mathematical analysis, and the theoretical framework for stability and bifurcation in stochastic systems remains less developed To mitigate this challenge, we incorporate techniques from electronic circuit design, such as the cyclic covariance function, to analyze the periods of stochastic limit cycles.

This paper presents various network connectivity configurations and kinetic parameters to develop a robust repressilator Initial designs using existing molecular components fail to produce sustained oscillations, prompting the identification of mutations that enhance oscillatory behavior A sensitivity analysis is conducted on design parameters, revealing how factors such as operator count, operator-repressor affinities, mRNA and protein half-lives, and the availability of ribosomes and RNA polymerase influence oscillation periods This information is valuable for synthetic biologists aiming to construct oscillating gene networks.

Modelsand Methods 0 0.000000 ee eee 196 3.3.3 The lac-tet-ara Gene Network 2.0 2 ee 199

In a system with N species, denoted as S, reacting through M pathways within a volume V, we define the state vector X(t) as an N-dimensional vector representing the number of molecules of each species at time t The stoichiometric reaction matrix, v, captures the changes in molecule numbers due to each reaction The probability of a reaction occurring within the system during a given time interval is expressed as a;(X(t))dt, which can be further detailed as a; = cjh;, where h; represents the number of possible reacting molecule combinations and c; is the mesoscopic reaction rate constant.

The original stochastic simulation algorithm (SSA) by Gillespie accurately simulates the trajectories of a jump Markov process defined by the Master equation Although improved variants have reduced computational costs while maintaining the jump Markov framework, the expense of simulation increases with the frequency of reaction occurrences This poses challenges for systems, such as the repressilator, where many biomolecular interactions, particularly protein dimerization, are classified as 'fast' reactions, leading to significant computational demands By treating these fast reactions as continuous, we can reformulate their mathematical representation into a continuous Markov process, allowing us to describe their dynamics using a set of chemical Langevin equations.

[73] The result is a system of Ité stochastic differential equations (SDEs) with multiple multiplicative noises, or

The equation Mfast Mfest dX; = Y viral (X(t) dt + VY vụ yay (X(t) aw; represents a model where 'a' denotes the vector of fast reaction propensities, 'W' signifies a multi-dimensional Wiener process, and 'v' is adjusted to focus solely on the fast reactions.

Integrating discrete-stochastic models, such as the Stochastic Simulation Algorithm (SSA), with continuous stochastic models like the Chemical Langevin Equation (CLE) presents a significant challenge To address this, we have developed a hybrid stochastic algorithm that effectively combines both models, demonstrating superior performance over the SSA while maintaining its accuracy This innovative algorithm segments the system into subsets of fast/continuous and slow/discrete reactions, utilizing the CLE to capture the effects of fast reactions It then employs a system of differential Jump equations to solve for slow reaction times, where the reaction residuals are determined by a uniform random number The differential Jump equations are solved by randomly selecting negative initial conditions, integrating them over time, and tracking the zero crossings of the reaction residuals.

The j’” slow reaction occurs at time + ; When R(t;) = 0 The system of differential Jump equations are also SDEs because they are coupled to the chemical Langevin equation via the state vector, X(r).

Using stochastic numerical integrators like the Euler-Maruyama or Milstein methods allows for solving coupled systems of stochastic differential equations (SDEs) while accurately determining the global error of chemical Langevin equations and slow reaction times This approach has demonstrated significant accuracy and can be exponentially faster than traditional stochastic simulation algorithms, particularly in the presence of fast reactions We apply this hybrid stochastic simulation method to analyze the stochastic dynamics of the repressilator gene network, computing at least 100 independent trajectories for each set of kinetic parameters.

Gene expression is inherently stochastic, leading to significant variability in single-cell behaviors despite identical initial conditions Internal noise causes fluctuations in protein and mRNA oscillations, affecting their periods, amplitudes, and phases To harness these oscillations effectively, gene networks must be engineered to minimize such fluctuations We employ a technique from electronic circuit design to quantitatively assess the stochasticity of these oscillations, treating the oscillating protein signals as cyclostationary signals By applying the Fourier transform to their autocorrelation functions, we can calculate the average and standard deviation of oscillation periods, ensuring accuracy even when signals are partially obscured by background noise.

The oscillatory concentrations of species S are characterized as a cyclostationary signal, represented by the discrete-index random process X;(t) The mean of this time series is defined as (f) = E|X;(t)|, while the covariance is given by C;;(t + τ) = E[(X;(t) - w;(t))(X;(t + τ) - w(t + τ))] A signal X;(t) is classified as cyclostationary if there exists an integer g such that (f) = y(t + g) and Cyi(t) = C(t + g; + τ) for all integers i, j To accurately determine the oscillation period, the cyclic correlation function C,,, can be calculated.

The equation Cyxx(O, +) = : Y {X(f)— m(Ð)] x [Xi +1) — mứ +1)]} exp(—7 0#) (3.17) describes a mathematical model where 7 represents the number of data points, + is set to zero, and ÿ denotes the square root of negative one Additionally, the cycle parameter, &, is defined as 2mn/P, with P indicating the period of oscillation and n being an integer.

The plot of Cy versus frequency (œ) reveals peaks that indicate the dominant oscillation periods in the signal X(t) A peak at œ = 0 is always present, indicating an infinite period when 7 is zero For each simulation and oscillating protein species, the primary non-zero peak (n = 1) is selected to determine the period, while additional peaks appear at harmonic multiples for higher n values The reported period is the average across all trials, with the standard deviation calculated from the œ values using the formula Op = 270,,/ œ?, where © represents the average over trials The cyclic correlation curves are derived by averaging all trials and normalizing to ensure the highest amplitude is one Systems with stable oscillations and minimal period fluctuations yield a distinct œ peak in the averaged cyclic correlation functions.

P TTTACA TAGCATTTTTATCCATAA TATGTT AGCGGATCCTAAGC ra” cm aral2

TTGACA TTGTGAGCGGATAACAA GATACT TTGTGAGCGGATAACAAG lạcệ]

TTGACA TCCCTATCAGTGATAGA GATACT TCCCTATCAGTGATAGAGA tetroy

Figure 3.13: The network connectivity for a lac-tet-ara oscillating gene network Below, the sequences of the promoter regions, using a single, promoter-overlapping operator per gene.

Table 3.11 presents the base reactions and kinetic rates for the lactet-ara system, with the reaction rates denoted in appropriate units: first-order reactions are measured in s⁻¹, while second-order reactions are expressed in (M s)⁻¹ Additionally, reactions characterized by two kinetic constants are defined as y-distributed events, where the first value represents the rate of each individual step, and the second value indicates the total number of steps involved.

1 2 LacI — Lacla 1.0e9 [244] 27 RNAp:DNAlac — RNAp + tetnrva 30 nt/sec, 660 nt [207]

2 Lael; — 2 Lacl 10 [244] 28 RNAp:DNAtet + RNAp + aradzRxa 30 nt/sec, 660 nt [207]

3 2 Lacl, — Lacly 1.0e9 [244] 29 RNAp:DNAara — RNAp + lac, rvs 30 nt/sec, 660 nt [207]

4 Lacl, — 2 Lac]l; 10 [244] 30 lacmgxy+ trib — rib:ẽaC„gxa 1.0e5 +

5 Lacl, + lacOl — Lacl¿:lacOl 5.e9 [247] 31 tet,.rva trib — ribstet,,ry4 1.0e5 +

6 Lacly:lacOl — Lacly + lacOl 3.85e-4 [247] 32 aTapgxa + rib — rib:aranrya 1.0e5 +

7 2 tetR — tetRi 1.0e9 § 33 TIb:laCmgxA —+* rib:lacrnay + lacmena 33 aa/see [248]

8 tetR, — 2 tetR 10g 34 rib:tet:rva —+ tib:tetrva) + tetyrna 33 aa/see [248]

9 tefR¿ + tetO2 — tetRz:tetO2 2.98e6 [249] 35 rib:aranrv4 — Tib:ara,,rva, + aanRva 33 aa/see [248]

10 tetR>:tetO2 — tetR¿ + tetO2 2.13e-2 [249] 36 1ib:laczgxa, — rib + LacI + Dlac 33 aa/sec, 220 aa [248]

11 2 araC — araC› 1.0e9 § 37 rib:tet RNa) — rib + tetR + Dtet 33 aa/sec, 220 aa [248]

12 araC, — 2 araC 10§ 38 rib:ara,,rva, — HD + araC + Dara 33 aa/sec, 220 aa [248]

13 araC› + arall/I2 — araCz:aral1/I2 1.0e7 {252]1 39 LacI — 2.31e-3 [247]

14 araCz:aral1/I12 — araCz + aral1/I2 4.0e-3 [252] 40 tetR — 2.3le-3 ý

15 RNAp + lacP:lacOl — RNAp:laeP:lacOl 2.0e6 [247] Al araC —

16 RNAp + tetP: tetO2 — RNAp:tetP:tetO2 8.6e5 [250] 42 Dlac —

17 RNAp + araP:aral1/12 — RNAp:araP:aral1/12 2.0e8 [252] 1 43 Dtet —

18 RNAp:lacP:lacOl — RNAp:lacP* 0.01 [247] 44 Dara —

19 RNAp:tetP:tetO2 — RNAp:tetP* 0.13 [250] 45 Lacl, — D

20 RNAp:araP:aral1/I2 — RNAp:araP* 0.167 [252 46 Lach, — 2.31e-3 [247]

21 RNAp:lacP* — lacP:lacOl + RNAp:DNAlac 30 nt/sec [247] 47 tetR¿ — 2.3le-3 ý

22 RNAp:tetP* — tetP:tetO2 + RNAp:DNAtet 30 nt/sec § 48 araCz — 1.93e-4 š

23 RNAp:araP* — araP:aral1⁄12 + RNAp:DNAara 30 nt/sec § 49 laCmgNa — 2.0e-3 £

24 RNAp:tetP:tetO2 — RNAp + tetP:tetO2 9.10 [250] 50 €z8NA — 2.0e-3 £

25 RNAbp:araP:aral1/12 — RNAp + araP:arall/12 0.06 [252] 51 arayRNa —* 2.0e-3 £

26 RNAp:lacP:lacOl — RNAp + lacP:lacOl 0.01 [247]

Notes on kinetic constant sources:

+values were adjusted to give approximately 20 proteins per mRNA.

‡based on typical protein degredation half ives §values were estimated for tet and wru parameters based on literature values for the uc system.

{the forward and backward reaction rates were estimated from a given K, value.

3.5.3 The /ac-tet-ara Gene Network

The /ac-tet-ara system is a gene network that can be experimentally realized, featuring various combinations of molecular components This network is designed to enable sustained oscillations, although they are not guaranteed In this system, the production of LacI monomers is inhibited by AraC proteins that bind to promoter-overlapping I1/I2 sites, while TetR monomers are repressed by LacI tetramers attached to promoter-overlapping /ac operators Additionally, the production of AraC monomers is suppressed by TetR2 dimers that bind to promoter-overlapping tet operators.

The /ac, tet, and ara operators can be modified by moving, replicating, or replacing them with mutant variants Alterations to the 5' UTR region of repressor mRNAs can enhance or reduce their degradation rates, and fusing repressor proteins with ssrA peptides can further increase degradation A specific configuration featuring one wild type operator regulating each gene, along with wild type repressor proteins and mRNAs, is illustrated in Figure 3.13, with its corresponding reaction mechanisms outlined in Table 3.11 Although numerous configurations exist, this discussion will concentrate on those that can be synthesized using currently available molecular components, including inducible promoters developed by Bujard and a simpler single operator design.

When substituting wild type DNA, proteins, or mRNA with mutant variants, we change the reaction kinetics of the system Conversely, replicating and positioning additional copies of an operator next to an existing one introduces new reactions, incorporating interactions between the new operators, repressors, and RNA Polymerases.

Table 3.12 presents a concise overview of experimentally characterized mutant DNA sites and repressor proteins from the /ac and tet operons Utilizing a comprehensive mechanistic model allows for precise simulation of the impact of introducing new genetic elements, avoiding potentially misleading approximations.

To link simulation results with observable phenotypes, a fluorescent protein coding sequence—like GFP, YFP, or CFP—is added bicistronically after each repressor coding sequence This setup allows for the quantitative measurement of fluorescent protein concentrations at the single-cell level using optical microscopy, with their production regulated by the corresponding repressor In this model, the half-lives of the fluorescent proteins, which are fused with ssrA peptides, are maintained at a constant 30 minutes For future applications, these fluorescent proteins can be substituted with functional proteins, such as enzymes The fluorescent proteins will be designated as Dlac, Dtet, and Dara, corresponding to their coexpressed repressor proteins.

Results and Discussion 2 0 ee 200

Designs Using Wild Type Kinetics Do Not Oscillate

The design of an oscillating gene network begins with wild type molecular components, featuring a configuration where a single operator regulates each gene using the wild type kinetics for the tetO2, lacO1, and aral1/12 DNA sites, along with the corresponding repressor mRNAs and proteins Analysis of the dynamical behavior and cyclic covariance functions of fluorescent proteins Dlac, Dtet, and Dara reveals an absence of sustained oscillations, with Dlac and Dara expressed constitutively while Dtet remains fully repressed The lacI tetramers exert excessive repression, whereas the araC dimers provide insufficient repression The cyclic covariance functions display a dominant peak at the center, indicating an infinite period, with Dlac and Dara showing a rapid decrease in amplitude following an initial peak at 0.225 hours, and harmonic peaks at integer intervals up to the simulation's conclusion at 27.7 hours In contrast, the Dtet species lacks a dominant period, exhibiting an infinite one and resulting in a bow-shaped arc.

If the system produces sustained oscillations, the cyclic covariance function would exhibit a sharp

The dynamic behavior of the Dlac, Dtet, and Dara proteins was observed over a period of 27.7 hours, showcasing significant fluctuations Additionally, the cyclic covariance function for the same system indicated a peak amplitude that, while lower than the central peak, surpassed the peaks noted at the simulation's conclusion.

In our example configuration utilizing wild type molecular components, we modify the regulatory regions by incorporating inducible promoter regions developed by Lutz, Lozinski, Ellinger, and Bujard The production of TetR, AraC, and LacI monomers is regulated by specific promoters, with the promoter regions featuring two tetO2 operators, two lacO1 operators, and a single araI1/I2 site The positioning of the araI1/I2 sites in the P promoter region causes AraC to function as a repressor, resulting in constitutive production of Dlac and Dara proteins, while Dtet expression remains fully repressed This configuration leads to no sustained oscillations due to an imbalance of repression among the three genes Our choice of these designs highlights the efficacy of computational modeling in swiftly identifying non-viable designs for oscillating gene networks, demonstrating that these initial configurations do not achieve the intended functionality.

An Oscillating Gene Network with a 3-2-1 Mutant Operator Configuration

We have created an asymmetric 3-2-1 operator design by modifying the Bujard promoter regions in two ways The first is the creation of a promoter region containing 3 lac operators, combining

The dynamical behavior of the Dlac, Dtet, and Dara proteins was analyzed using a 3-2-1 operator configuration with mutant tet and lac operators, along with a TetR protein half-life of five minutes The modified lac operators exhibit a decreased affinity to the Lacl tetramer, while the tet operators show increased affinity to the TetR2 repressor Over a 5.8-day interval, sustained oscillations were observed primarily in the Dara species Enhancing the affinity of the aral12 site and reverting the TetR protein half-life to its wild type improved oscillation quality, resulting in an average oscillation period of 16.2 hours, closely resembling natural circadian rhythms Further adjustments, including reducing the TetR protein half-life to 10 minutes, led to a decreased average oscillation period of 15.3 hours and reduced variability, with equal amplitudes among the fluorescent proteins.

A Systematic Analysis of the Oscillation Envelope

In silico modeling enhances research by enabling comprehensive exploration of parameter space, crucial for determining the oscillation envelope's width This method allows for a detailed analysis of initial concentrations of RNA polymerases and ribosomes, as well as the degradation rates of proteins and mRNAs, leading to more accurate and useful results.

The dynamical behavior of the Dlac (red), Dtet (blue), and Dara (green) proteins is analyzed through a 3-2-1 operation design, incorporating mutant tet, lac, and ara operators The study also considers the half-life of the TetR protein, providing insights into the interactions and stability of these regulatory proteins.

30 minutes (B) The same system as in (A), but with a TetR protein half-life of 10 minutes.

0.3} fy f Ạ AI A \ 03} ie tài „— vy Un 4

Ot beer wa wd 0.4 a, 1/hour gh -2

MUNA # vĩ vi 5Ÿ Aan ie an AW

In a study of protein oscillation dynamics, the normalized average cyclic correlation functions of Dlac (red), Dtet (blue), and Dara (green) proteins were analyzed using a 3-2-1 operator configuration with mutant tet, lac, and ara operators When the TetR protein had a half-life of 30 minutes, the oscillation period was observed to be 16.2 ± 4.1 hours In contrast, with a reduced TetR protein half-life of 10 minutes, the oscillation period decreased to 15.3 ± 2.7 hours, with vertical gray lines indicating the standard 68% confidence interval.

The study examines how the number of operators and the affinity between repressors and operators influence the oscillation period To focus on specific variables, a symmetric "survey model" was developed, balancing the asymmetric wild-type Jac, tet, and ara operons, as outlined in Table 3.11 This model features one to three operator sites that regulate repressor production, with symmetric rate constants aligned with the available molecular components, encompassing both wild-type and mutant variants.

The study examines the repressor's affinity for operator sites by adjusting equilibrium binding constants through changes in the half-lives of repressor-operator complexes With many DNA-protein interactions having forward rate constants close to the diffusion limit of approximately 10^8 (Ms)^-1, this rate constant is held constant By varying the degradation rate of the complex, affinities ranging from 10^0 to 10^7 M^-1 are achieved, corresponding to half-lives from 19 hours to 7 seconds The sensitivity analysis results provide essential design rules for constructing a three-gene repressilator.

Total repression must be balanced, avoiding extremes of strength In a symmetric survey model with three operator sites per regulated gene, significant oscillations occur across a repressor-operator affinity range of 10? to 10!! M~! The oscillation period is influenced by this affinity; for example, at 102 M~!, it lasts 3.39 hours, while at 10! M~!, it extends to 11.55 hours Removing one operator site from each gene results in a 2-operator model that exhibits broader oscillations, with periods ranging from 20.04 hours at an affinity of 1012 M7! to 2.94 hours at 10? M~! In contrast, a single operator per gene yields a narrow envelope of oscillation near an affinity of 10'!! M7~!, resulting in irregular patterns Overall, increased repressor-operator affinity corresponds to longer oscillation periods.

To examine the effects of asymmetry in gene regulation, a series of models was created, maintaining a fixed repressor-operator affinity of 10!° M7! across all operator sites of two genes while varying the affinities of a third gene's sites The findings reveal that the asymmetric model exhibits a broader range of oscillation periods compared to the symmetric model, with oscillation periods of 4.93 hours at 10? M~! (compared to 3.39 hours in the symmetric case) and 11.25 hours at 10'* M~! (where no oscillations occur in the symmetric scenario) When two active operator sites per gene are present, the model oscillates with periods ranging from 3.85 to 7.78 hours across the same affinity spectrum However, with only one operator site per gene, regular oscillations are absent These results are visually summarized in Figure 3.18.

The study reveals that a system can oscillate with repressor-operator affinities varying by up to two orders of magnitude across different gene regulatory regions, assuming equal operator site numbers for each gene However, excessive asymmetry in repression levels dampens these oscillations While an increase in affinity generally leads to a longer oscillation period, this trend is less pronounced when only one gene's operator sites are altered The oscillator's period corresponds to the cumulative pulse widths of three fluorescent protein concentrations, defined by the time required for each gene to produce its repressor, experience repression from another repressor, and for both mRNA and protein levels to return to zero Variations in operator numbers and repressor-operator affinities result in unequal pulse widths The symmetric model shown in Figure 3.19 illustrates equal repressor-operator affinities across all operons, contrasting with the asymmetric model where one operon's affinities differ.

The graph in Figure 3.18 illustrates the relationship between the period of oscillation and the affinity between repressors and operators in genetic models It presents data for models featuring two operators per gene (A, B) and three operators per gene (C, D) In the scenarios depicted in (A) and (C), the affinities of a single set of operators are varied, while the affinities of the remaining two sets are maintained at a constant level of 10!° M~!.

D) The affinities of all operator sites in all genes are symmetrically altered The forward kinetic constant is always 10 (M s)~!. increased to 1012 M~! Only one marker protein is shown, as the other two were unaffected by the change.

The investigation of variations in the forward rate constant revealed that while the forward rate was maintained at 10° (M s)~!, some repressors, such as the lac repressor, exhibit binding rates exceeding this value, although such cases are uncommon Models utilizing the conservative forward rate of 10° (M s)~! were analyzed across various affinities and symmetries, generally displaying poor oscillation capabilities When oscillations did occur, they exhibited longer periods compared to systems with higher forward rates but the same affinity For instance, a symmetric model with three operators per gene and an operator-repressor affinity of 10? M~! showed oscillation periods of 3.39 hours and 5.89 hours for forward rates of 10° (M s)~! and 10° (M s)~!, respectively This indicates that the forward binding rate of a repressor to its operator is as crucial as its affinity for effective oscillation behavior.

Conclusions 0 S4 207

The modified ac-tet-ara system showcases how a gene network, formed from unrelated molecular components, can produce a consistently periodic protein product Notably, affordable simulations have facilitated the development of straightforward design rules that guide initial experimental cycles These design rules distill simulation outcomes into concise guidelines, simplifying the construction of robust oscillating gene networks.

The design parameters significantly influence the period of oscillation in gene regulation Specifically, a higher repressor-operator affinity extends the oscillation period, while increasing the number of operators for each gene enhances sensitivity to repressors, known as cooperativity However, if the affinities of the operators for three genes vary by more than two orders of magnitude, sustained oscillations become unattainable To achieve a specific oscillation period, one can adjust the number of operators and the repressor-operator affinity accordingly Utilizing a single operator typically leads to unsustained oscillations, and incorporating more than three overlapping operators poses challenges Additionally, extending the half-life of mRNA or repressor proteins can prolong the oscillation period, while the availability of RNA polymerases and ribosomes inversely affects oscillation dynamics.

The impact of protein and mRNA half-lives on oscillation periods is illustrated in Figure 3.21 In part (A), the half-lives of various mRNA species are symmetrically adjusted between 5 to 15 minutes In part (B), while two mRNA species maintain a fixed half-life of 5 minutes, the half-life of the third species is altered within this range.

The half-lives of protein species vary symmetrically between 10 to 60 minutes, with specific genes maintaining constant half-lives of 20 minutes for their protein products, while others fluctuate within the same range RNA polymerases compete with repressors for promoter binding, leading to an increase in RNA polymerase levels having a similar impact as reducing repressor-operator affinity Additionally, boosting ribosome numbers parallels the effect of extending repressor half-lives Although experimental manipulation of RNA polymerase and ribosome expression is challenging, understanding how metabolic shifts affect oscillation periods is crucial Notably, transitioning from exponential to stationary growth phases alters the oscillation period due to changes in RNA polymerase and ribosome quantities.

Constructing and testing variant models can be prohibitively expensive; therefore, we utilize stochastic simulations of a detailed mechanistic model While these simulations are based on several assumptions, they offer a framework of verifiable and falsifiable rules regarding known interactions The initial cycle of targeted experiments allows for direct refinement of the model, enabling adjustments to incorrect assumptions or kinetic parameters This detailed mechanistic approach simplifies the modification of kinetic characteristics for specific molecular interactions, which can be directly measured through experiments In contrast, alternative models that rely heavily on coarse-grained or lumped interactions pose challenges in adaptation, as they amalgamate multiple biological processes without fully accounting for their independent actions.

In the near future, toolboxes containing known DNA sequences and protein molecules with well-characterized kinetic parameters will be developed, with initial successes already reported By integrating these molecular components into synthetic gene networks, we can create innovative and functional systems However, as these networks grow in complexity, our intuitive predictions of their behavior may falter To address this, detailed mechanistic models and stochastic simulation techniques can be employed to efficiently identify the essential molecular components and network connections needed to achieve specific dynamic behaviors.

The increasing complexity of DNA binding sites, modular protein domains, and RNA secondary structures makes it challenging to intuitively assemble the various components into a functional synthetic gene network with specific phenotypic outcomes Developing mathematical models for these synthetic gene networks allows for efficient and cost-effective exploration of potential combinations, helping to identify networks that achieve desired behaviors Future advancements will rely on broadening the range of available building blocks and precisely measuring their molecular interactions.

In this chapter, we integrate experimental, computational, and theoretical methods to thoroughly investigate the expression of a synthetic promoter in response to varying environmental conditions We design and analyze a DNA sequence using a reliable toolbox of genetic components, aimed at producing the reporter Green Fluorescent Protein (GFP) only when two chemical inducers are present at high concentrations This synthetic promoter functions analogously to an "AND" logic gate, ensuring precise control over gene expression.

We utilize genetic engineering techniques to synthesize a specific DNA sequence, which is then inserted into a plasmid and transformed into F coli cells By employing flow-assisted cell sorting (FACS), we assess the expression rate from the synthetic promoter across sixteen inducer concentrations and seven time points A systematic analysis of molecular interactions among the promoter, proteins, and RNA is conducted, leading to the development of a steady-state mathematical model that is compared with experimental data This analysis allows us to calculate an unknown thermodynamic parameter and evaluate our design objectives Ultimately, we leverage the model's predictive capabilities to propose a new DNA sequence for the synthetic promoter, enhancing its responsiveness to inducers.

The synthetic promoter illustrated in Figure 4.1 features two overlapping tetO2 operators from the tet operon and one overlapping lacO1 operator from the lac operon Following the promoter is a Shine-Dalgarno (SD) ribosome binding site, which exhibits a weak sequestering mRNA secondary structure These regulatory elements facilitate the expression of the cycle3 gfp gene, which is 42 times brighter than the wild-type gfp while maintaining identical absorbance and emission spectra.

The gene (on a plasmid) is inserted into a strain of E coli that constitutively expresses the lac

4

OverviewofCHapET ng kg Và 211

In this chapter, we outline the molecular biology techniques used to synthesize a desired DNA sequence and insert it into a promoter-less cycle3 gfp gene, resulting in a new plasmid We utilize flow-assisted cell sorting (FACS) to characterize the expression dynamics of cells transformed with this plasmid Experimental data reveals that the synthetic promoter functions as a fuzzy AND logic gate in response to chemical inducers A systematic analysis of molecular interactions affecting the synthetic promoter's expression leads to the development of a physical-chemical steady-state model By comparing experimental data with model results, we identify a key thermodynamic parameter that accounts for the fuzzy AND behavior, calculated with high accuracy This model also allows us to propose a new synthetic promoter with an enhanced DNA sequence designed to perform more like a true AND logic gate Our findings are summarized in the final section.

Synthesizing and Cloning the ConsHUuct cv cv 211

Two pairs of oligonucleotides, whose sequence is shown in Table 4.1, were designed and purchased from IDT DNA Technologies The primary oligonucleotides are each 110 bp long and possess an

Table 4.1: The primary and secondary pairs of oligonucleotides used in these experiments are shown.

The study involved the design of a 20 bp overlapping complementary sequence (5’-TCATGAACCGGTTTCCTTCT-3’) for PCR amplification, yielding a 200 bp product that includes 118 bp of random DNA with 50% GC content and an 82 bp synthetic promoter The secondary oligonucleotides were derived from the 5’ end and the reverse complement of the 3’ end of the final PCR product A primary PCR extension reaction was conducted in a 50 µL volume using tag polymerase, DNTPs, buffer, water, and a pair of 110 bp oligonucleotides, with an initial denaturation at 94°C for 3 minutes, followed by a one-hour extension at 72°C This was followed by a secondary amplification, mixing 5 µL of the primary reaction with additional reagents, and running a program that included 3 minutes of initial denaturation at 94°C, 35 cycles of annealing at 53°C, extension at 72°C, and denaturing at 94°C, concluding with a final extension at 72°C for 10 minutes The resulting PCR products were purified via agarose gel electrophoresis, isolating fragments between 180-220 bp for further analysis using a gel extraction kit from Qiagen.

We inserted the purified PCR product into the pGLOW TOPO plasmid (Invitrogen) following the manufacturer's instructions The pGLOW plasmid features an ampicillin resistance gene and a promoter-less cycle3 gfp gene, which is approximately 42 times brighter than the wild-type gfp while maintaining a similar excitation and emission spectrum For clarity, we refer to the cycle3 gfp variant as GFP, encompassing both the gene and the protein it encodes The plasmid's TOPO points facilitate the insertion of a promoter and ribosome binding site (RBS) sequence to enhance GFP expression The reaction mixture consists of 1 µL TOPO plasmid and 1 µL salt solution.

4 uL purified PCR product and was let stand for 5 minutes at room temperature The overhanging

The Taq-amplified PCR product complements the overhanging Ts at the TOPO points, allowing for efficient ligation by a covalently attached topoisomerase This process results in the formation of the pGLOW plasmid, which incorporates our desired promoter and ribosome binding site (RBS) sequence to drive GFP expression.

We successfully transformed chemically competent Top10 cells (Invitrogen) with the pGLOW plasmid by combining thawed cells with 3 uL of TOPO reaction mixture After a 30-minute incubation on ice, we applied a 90-second heat shock at 42°C, followed by a quick return to ice The cells were then provided with 250 uL of SOC medium, gently shaken at 37°C for one hour, and subsequently spread onto two ampicillin agar plates (LB).

Broth + agarose + 200 mM ampicillin) at 200 uwL and 50 L volumes, respectively, so that at least one plate contained well-spaced colonies The agar plates were incubated overnight at 37°C,

We isolated eight single colonies from the plates and cultured them overnight in 2 mL LB broth with ampicillin After pelleting the cells, we lysed them and purified their plasmids using a Miniprep silica column kit from Qiagen The promoters inserted in the plasmids were sequenced with the specific primers for the pGlow plasmid, allowing us to select colonies with error-free promoters This plasmid is referred to as LT1 pGLOW Additionally, we identified a colony with a pGLOW plasmid containing a junk promoter, which we designated as our negative control, p@LOW Neg.

Initial Confirmation of “AND”-like Promoter Activity

We transformed DH5œ and DH5aPro chemically competent cells with LT1 pGLOW and pGLOW Neg plasmids to evaluate promoter and RBS sequence functionality The DH5aPro strain, which expresses Lac and Tet repressors, contained 3000 and 7000 repressor molecules per cell, respectively, while the DH5œ strain lacked these repressors This allowed us to confirm that DH5œ with LT1 pGLOW expressed GFP, whereas DH5aPro with LT1 pGLOW did not Additionally, both DH5a and DH5œPro cells with pGLOW Neg showed no GFP expression All transformed strains were cultured overnight in LB + Amp at 37°C.

The DH5a@Pro LT1 pGLOW strain requires high concentrations of aTC and IPTG for significant GFP expression Testing involved agarose plates with varying concentrations: 0 ng/mL aTC and 0 mM IPTG (-/-), 200 ng/mL aTC and 0 mM IPTG (+/-), 0 ng/mL and 2 mM IPTG (-/+), and 200 ng/mL aTC and 2 mM IPTG (+/+) After incubation at 37°C, results showed negligible GFP production in the (-/-) condition and when only IPTG was present Low GFP levels were observed with aTC alone, increasing from 18 to 36 hours In contrast, the combination of aTC and IPTG led to a significant accumulation of GFP at both time points The negative control, DH5œPro pGLOW Neg cells, did not express GFP under any conditions.

Sampling Inducer-Dependent Expression over Time

We conducted a quantitative analysis of GFP concentration in DH5αPro LT1 pGLOW cells using flow-assisted cell sorting (FACS), examining the effects of varying concentrations of aTC and IPTG over time Our objective was to assess how these different concentrations influence both the production rate and steady-state concentration of GFP while ensuring the cells remained in the exponential growth phase Maintaining this phase is crucial, as physiological changes, such as sigma factor availability, are significantly altered when cells transition to the stationary phase of growth.

We initiated the experiment by culturing DH5αPro LT1 pGLOW cells in (-/-) media overnight The following day, we inoculated sixteen different conditions with 10 µL of DH5αPro LT1 pGLOW cells.

A Lac Tet “AND* Gate under four different inducible conditions at 18 hours induction

Figure 4.2: Agar plates streaked with DHSœPro E coli cells containing the LT1 pGLOW plasmid, incubated at 37°C for 18 hours, are shown (A) without IPTG or aTC (B) with 2mM IPTG, (C) with

200 ng/mL aTC, and (D) with 2mM IPTG and 200 ng/mL aTC Writing with glowing bacteria is certainly fun!

A Lac/Tet “AND” Gate under four different inducible conditians at 36 hours induction

Figure 4.3: Agar plates streaked with DHSœPro E coli cells containing the LT1 pGLOW plasmid, incubated at 37°C for 36 hours, are shown (A) without IPTG or aTC (B) with 2mM IPTG, (C) with

In our experiment, we utilized cultures containing 1 mL of Amp LB Broth with varying concentrations of aTC and IPTG, specifically 1, 10, 100, and 200 ng/mL aTC, alongside 0.01, 0.1, 1.0, and 2.0 mM IPTG To ensure accurate dispensing of the inducers, we prepared stock solutions of different media combinations, which allowed us to minimize measurement errors when adding small quantities of inducers Throughout the experiment, all cultures were maintained in a gently shaking water bath at 37°C, with minimal interruptions for sampling or dilution.

At 3, 6.5, and 9 hours post-inoculation, we collected 200 µL samples from each culture for future FACS analysis Following each sample extraction, we diluted the cell concentration in the cultures by removing an additional 700 µL and replacing it with fresh medium at a 1:10 ratio.

In our experiment, we began with 900 µL of media containing a consistent inducer concentration and allowed the cultures to grow overnight for 9 to 26 hours without dilution At the 26-hour mark, we extracted 200 µL from each culture for FACS analysis and subsequently transferred the remaining cultures to inducer-free media This process involved spinning down the cells into pellets, washing them with sterile PBS, and resuspending in 1 mL PBS From each suspension, we inoculated sixteen additional tubes, each with 1 mL of (-/-) Amp LB Broth We then collected 100 µL samples from these cultures at 3, 4, and 7 hours post-transfer for further FACS analysis, resulting in a total of 112 samples Each sample was fixed by pelleting, washing in PBS, resuspending in 4% paraformaldehyde for 30 minutes, and then washed again in PBS before final resuspension.

Characterization of Samples with FACS 0 0 0004 216

The flow-assisted cell sorter (FACS) is utilized to measure fluorescence intensity distribution in cell populations, allowing for the identification of a distinct population of E coli cells, separate from potential debris or cellular particles During the exponential growth phase, minimal debris and dead cells were observed A consistent large gate was established around the E coli population throughout measurements, with samples diluted at least 1:1000 into cytometry tubes Fluorescence intensity was measured for 10^6 E coli cells, limiting the acquisition rate to under 2000 cells per second to avoid erroneous fluorescence summation Data files detailing GFP fluorescence across all time points and inducer concentration variations were exported for analysis and plotting using MATLAB (Mathworks).

This study investigates the impact of sixteen concentrations of aTC and IPTG inducers on the expression activity of a synthetic promoter, which is regulated by the /ac and tet repressors The findings reveal that the synthetic promoter exhibits a Boolean “AND”-like behavior, with maximal GFP expression occurring only at high concentrations of both inducers (100 ng/mL aTC and 1 mM IPTG or higher) In contrast, lower concentrations of either inducer result in diminished expression levels.

Fluorescence Intensity Fluorescence Intensity Fluorescence Intensity

The impact of IPTG concentration on GFP fluorescence probability distribution is illustrated at three time points: (A) 3 hours, (B) 6.5 hours, and (C) 9 hours post-inoculation The varying IPTG concentrations are represented by different colors: 0.01 mM (green), 0.1 mM (blue), 1 mM (red), and 2 mM (black), while the aTC concentration remains constant at 1 ng/mL.

The impact of increasing aTC concentration on GFP fluorescence probability distribution is illustrated at three time points: 3 hours, 6.5 hours, and 9 hours post-inoculation The concentrations of aTC examined are 1 ng/mL (green), 10 ng/mL (blue), 100 ng/mL (red), and 200 ng/mL (black), with IPTG maintained at a constant 0.01 mM Notably, at low aTC concentrations (10 ng/mL or higher), there is a significant occurrence of “leaky” expression In contrast, when only IPTG is present, GFP expression remains below the detectable threshold, constrained by cellular autofluorescence.

The “ON” Dynamics of the Synthetic Promoter Expression

We assessed the rise time required for the system to achieve steady-state, where the production rate of GFP matches its dilution rate This rise time is influenced by the concentrations of aTC and IPTG, the transport rate of the inducers across the plasma membrane, and the maximum production rate of GFP Each concentration of aTC and IPTG was evaluated to analyze the distribution of GFP fluorescence.

Relative Probability ơ NWP UH Ơ x â A€O 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 oO => NW fk ỨC A 4 â O la Ww + wm œ ~ 00

The impact of varying concentrations of aTC and IPTG on GFP fluorescence probability distribution is illustrated at three time points: (A) 3 hours, (B) 6.5 hours, and (C) 9 hours post-inoculation The tested concentrations include 1 ng/mL aTC (green) and 0.01 mM IPTG (blue).

The production rate of GFP is influenced by varying concentrations of aTC and IPTG, specifically at levels of 10 ng/mL aTC with 0.1 mM IPTG (red), 100 ng/mL aTC with 1 mM IPTG (blue), and 200 ng/mL aTC with 2 mM IPTG (black) Notably, the fluorescence stabilizes to a constant, time-independent level within approximately 9 hours, as illustrated in Figures 4.4, 4.5, and 4.6.

At low concentrations of aTC (1 ng/mL), varying levels of IPTG do not influence promoter activity, as shown in Fig 4.4 The GFP fluorescence observed is primarily due to cellular autofluorescence, with values ranging from zero to 200 fluorescence units (flu) Cells exhibiting fluorescence below 200 flu indicate the presence of a synthetic promoter that remains uninduced by either aTC or IPTG for GFP production Conversely, when IPTG concentration is low (0.01 mM), increasing aTC levels significantly enhance promoter activity.

The initial production rate of GFP expression increases sigmoidal with aTC concentrations ranging from 1 to 200 ng/mL, resulting in a faster rightward shift in distribution This indicates that while the synthetic promoter enhances GFP production in response to aTC, IPTG alone does not effectively induce GFP expression.

The addition of both aTC and IPTG to the system results in a sigmoidal production rate of GFP, which is dependent on the concentrations of these inducers At moderate levels of aTC and IPTG, the production dynamics are notably influenced.

The initial production rate of GFP significantly increases when 10 ng/mL aTC is combined with 0.1 mM IPTG, outperforming the rate achieved with 10 ng/mL aTC alone and reaching approximately half of the maximum production rate At concentrations of 100 ng/mL aTC and 1 mM IPTG or higher, the GFP production rate reaches its peak, with the system achieving steady-state conditions within a maximum of 6.5 hours.

Inducer molecules such as aTC and IPTG need time to traverse the periplasmic and plasma membranes of bacterial cells to bind to their respective repressors, facilitating the activation of synthetic promoters The transport of these molecules is inherently stochastic, resulting in varying intracellular concentrations of aTC and IPTG among cells Notably, IPTG is particularly slow to cross the plasma membrane, taking approximately four to six hours to achieve chemical equilibrium, which can delay the system's response.

IPTG [mM] 0.01119 — aTC [ng/mL] IPTG [mM] 001110 aTC[ng/mL]

IPTG [mM] 0.01 11° — aTC[ng/mL]

Figure 4.7 illustrates the average GFP fluorescence at various concentrations of aTC and IPTG over time, specifically at 3 hours, 6.5 hours, and 9 hours post-inoculation Notably, there remains a population of cells that have not been induced by either aTC or IPTG, with the effects being more pronounced for IPTG due to its slower transport dynamics The sigmoidal relationship between GFP production rates and aTC/IPTG concentrations is evident, revealing two distinct plateaus in GFP fluorescence over time The low plateau arises from minimal aTC concentrations, which enhance GFP production at sub-maximal rates, while the high plateau occurs when both aTC and IPTG induce GFP production to its peak levels A deeper analysis of the steady-state distributions of GFP fluorescence at the 9-hour mark will shed light on the reasons behind these two plateaus in GFP production.

0.01 mM 0.1 mM 1.0mM 2.0 mM 3p aTC 0.01 001 0.01 0.01

The steady-state distributions of GFP fluorescence at 9 hours, influenced by different concentrations of aTC and IPTG, are illustrated in Figure 4.8, with the means of these distributions highlighted by red circles.

The Steady-State Distribution of GFP Fluorescence over Varying Inducer

The steady-state distributions of GFP fluorescence across sixteen concentrations of aTC and IPTG reveal a rightward shift and increased Gaussian characteristics with higher concentrations At low concentrations of both aTC and IPTG, the distribution is notably non-Gaussian, exhibiting a mean of 86 flu, a standard deviation of 88 flu, and a skewness of 4.9, indicating a heavy right tail While increasing IPTG concentration has minimal impact on the steady-state distribution, raising aTC concentration significantly elevates the mean from 88 to 394 flu The most Gaussian distribution occurs at 200 ng/mL aTC and 0.01 mM IPTG, featuring a standard deviation of 82 and a skewness of 0.12.

ee 220 4.3.3 Cell Division Rates over Varying Inducer Concentration

The Participating Molecular Interactions cv 221

The kinetic and thermodynamic parameters describing the following molecular interactions are listed in Table 4.3.

The synthetic promoter's annotated sequence and its transcriptional initiation interactions are depicted in Figure 4.9A The Lac repressor tetramer binds to the overlapping lacO1 operator with a Gibbs free energy of ΔG_juc_pw4, while the Tet repressor dimer can attach to either of the two tetO2 operators with a Gibbs free energy of ΔG_ye_pwa These DNA-binding proteins, known as repressors, inhibit the RNA polymerase - σ factor complex (Holoenzyme) from associating with the promoter's -35 and -10 hexamer sequences The steric interactions influencing this binding are determined by the repressor proteins and the spatial arrangement of the operator and Holoenzyme binding sites Specifically, the interaction between the Lac tetramer at the upstream lacO1 operator and the Holoenzyme is measured by a positive Gibbs free energy, ΔG_oi_gxap, while the interactions of the Tet dimer at the tetO2 operators are quantified by ΔG_g2_ryap and ΔG_o3_gxap These positive Gibbs free energies result from van der Waals interactions as the two molecules compete for the same spatial position.

In Figure 4.9BC, the binding of Lac tetramer and Tet dimer repressors to their inducers significantly decreases their affinity for DNA operators, with equilibrium association constants K/“° and K/“ When two IPTG molecules bind to the Lac tetramer, the Gibbs free energy of binding increases from -14.5 to -10.9 kcal/mol, while the binding of one aTC molecule to the Tet dimer raises the Gibbs free energy from -15 to -11 kcal/mol These repressors also exhibit non-specific binding to genomic DNA, with a Gibbs free energy of approximately -7.2 kcal/mol, leading to competition between specific operator sites and genomic DNA Although operator sites are limited, they have a more negative Gibbs free energy compared to the numerous, less negative binding sites on genomic DNA The binding of aTC or IPTG to their respective repressors shifts the competition towards non-specific binding sites, resulting in an equidistribution of repressors on genomic DNA Furthermore, these inducers cause conformational changes in DNA-bound repressors, leading to a more positive Gibbs free energy Finally, the Holoenzyme complex demonstrates binding affinity to the -35 and -10 hexamer sequences of the promoter, initiating transcription in a first-order reaction characterized by a kinetic constant, kzz.

In Figure 4.10, we illustrate the molecular interactions that influence translation and GFP protein production The 30S ribosomal subunit attaches to the ribosome binding site (RBS) at the 5' end of the mRNA, where a complementary sequence often exists with the 3' end of the 16S rRNA This RNA:RNA duplex stabilizes the rRNA on the mRNA, facilitating ribosome assembly A stronger RNA:RNA duplex, indicated by its Gibbs free energy (ΔG), can accelerate ribosome assembly up to a limit Conversely, any factors that hinder the formation of this duplex will impede ribosome assembly.

AG Q1 RNAP AG 92 RNAP AG 93.-RNAP

AAAT TGTGAGCGGATAACAA TIGACA TCCUTRITAG A GATACT A AG AGA AGGAAACCGGTTC ATG lacO1 tetO2 tetO2 RBS— -

AG lac DNA AG tet DNA AG tet-DNA

AG lac:IPTG DNA AG tet:aTC DNA

AAATTGTGAGCGGATA SA TTGACA TCCCTATCAGTGATAGA GATACT ATCCCISACAGTGATAGAGA AGGAAACCGGTTC ATG

The synthetic promoter sequence includes key components such as the lacO1 and tetO2 DNA operators, the -35 and -10 hexamer sequences, and the ribosome binding site (RBS) Additionally, the Gibbs free energies associated with the interactions between each repressor and its respective operator are highlighted, including values A Œ1„e_pwa and AGrer_pna The steric interactions among repressors within the holoenzyme context are also presented, denoted as AGoi~rnap, AGo2-pna, and AGg3_pna Furthermore, the Lac repressor tetramer demonstrates binding capacity with multiple operators.

4 molecules of IPTG with an equilibrium association constant, K/“°, resulting in a more positive

The Tet repressor dimer can bind up to two molecules of aTC, leading to a more positive Gibbs free energy (AG) between the repressor and its operator Additionally, the Holoenzyme complex exhibits a binding affinity for the -35 and -10 hexamer sequences of the promoter, characterized by a specific Gibbs free energy (AG) Once the Holoenzyme successfully assembles on the promoter, it initiates transcription as a first-order reaction, defined by a kinetic constant.

Nom” am fal Initiation — `, iol Uae \

U Sy 0 me" _ ¢ ` 30 Vad N h Gm pal \

; oN se = AG =-13.7 kcal/mol

AG =-2.5 kcal/mol % G ca folding —: rRNA ao ;

The initiation of translation involves several mechanistic steps, starting with the binding of the 30S ribosomal subunit to the 5' end of mRNA at the ribosome binding site (RBS) This RBS typically features the Shine-Dalgarno sequence, which is highly complementary to the 3' end of the 16S rRNA in the 30S subunit The strength of this complementation can be assessed by calculating the Gibbs free energy of binding between the RNA fragments Additionally, mRNA may form secondary structures, like hairpins, that can obstruct ribosomal binding to the RBS The stability of these secondary structures influences the frequency with which the RBS remains in an unfolded state, which can also be quantified using Gibbs free energy calculations.

AG folding involves calculating Gibbs free energies through various programs like UNAFold Upon the binding of the 30S ribosomal subunit, the ribosome fully assembles and initiates translation, leading to the folding and maturation of the reporter GFP protein The mRNA secondary structure at the 5' end sequesters the ribosome binding site (RBS), preventing it from binding to the 16S rRNA in E coli This mechanism includes the formation of a hairpin structure that incorporates the RBS sequence, effectively hindering translation initiation Given the rapid kinetics of RNA folding and unfolding, the 5' end of the mRNA is assumed to be in chemical equilibrium, resulting in a specific concentration of mRNA in its unfolded state.

1 + exp (- HN [MRNA unfolded] = (4.1) where AGmRNA is the Gibbs free energy of the mRNA secondary structure that encompasses the

RBS Once the ribosome assembles on the ribosome binding site, it will initiate translation in a first order reaction with a kinetic constant #;z„;.

The Steady-State Governing EqQUuAHO'NS Q0 Quy 225

We now derive the steady-states equations that relate the average GFP fluorescence to the concentration of IPTG and aTC inducers.

The Lac tetramer and Tet dimer repressors interact with IPTG and aTC inducers, exhibiting equilibrium association constants of K/“° and K?“ As a result, the average steady-state concentration of Lac and Tet repressors, both in free form and bound to their respective inducers, can be determined.

[Lacly] = aration [Lacl:IPTG4] = [Lacl4], — [Lacla]

TetRa]=rrerfe (TetR:aTC;] = [TetRy], — [TetR2)

After the assembly of the Holoenzyme on the promoter, all molecular interactions can be summarized by a single constant that remains unaffected by the concentrations of the aTC and IPTG inducers.

The article discusses various factors influencing the kinetics of transcriptional and translational initiation, including the assembly of the holoenzyme and the kinetic rates of translation initiation (Kjans) It also addresses the plasmid copy number (Npusmig), the degradation and dilution of gfp mRNA (dgiy + mena), and the degradation of GFP protein (day + Sgrp) Additionally, the formation of an mRNA secondary structure with a Gibbs free energy change (AGMR*) that sequesters the ribosome binding site is highlighted The relationship between fluorescence units (flu) and GFP concentration is represented by an unknown factor (FLqrp), emphasizing that these interactions are independent of aTC or IPTG presence.

Table 4.3: Baseline values of kinetic/thermodynamic parameters.

Freely available RNA polymerase, RNAP,

Total Lac repressor tetramers, Lacl4

Total Tet repressor dimers, TetRa

From 0 to 2 mM (0 to 1204400 molecules per cell) From 0 to 200 ng/mL (0 to 271 molecules per cell)

-14.5 kcal/mol [180] -15.0 kcal/mol [182] -10.9 kcal/mol [179] -11.0 kcal/mol [182] -7,.2 kcal/mol [263]

Kinit 0.005 transcripts/sec rang 6.7035 protein / transcript / sec AGI tne -2.5 kcal/mol Nolasmia 75 copies Protein and mRNA Degradation and Dilution’

Cell doubling time, daiy 1.5 hours mRNA half-life ỗmRNA 10 minutes GFP half-life, ôcrp 6 hours

Multiplicative factor ox Pj init

463.7 ủu at maximum Holoenzyme occupancy

> These parameters are lumped together to create the final inducer-independent multiplicative constant. © flu: Fluorescence units.

The Overall Steady-State Equation

The steady-state GFP fluorescence can be expressed through a simple algebraic equation comprising three key terms Among these, only one term is influenced by the concentrations of aTC and IPTG The first term represents the probability of successful Holoenzyme assembly on the promoter, denoted as Đ„;, and is calculated using thermodynamic Gibbs free energies via the chemical partition function This term is dependent on the concentrations of free and bound repressor, as outlined in Eq (4.2) The second term is an inducer-independent multiplicative constant, Ò, which quantifies the GFP fluorescence produced when the Holoenzyme is consistently assembled at the promoter Lastly, the equation includes an inducer-independent additive constant, Cpuckground © 86, representing the background autofluorescence of the cells Together, these components enable the calculation of steady-state GFP fluorescence.

The concentration of the inducer directly influences the molecular interactions involved in Holoenzyme assembly, while those in the subsequent post-Holoenzyme assembly processes remain unaffected This distinction allows for precise alignment of experimental data with the model results discussed later.

The probability of the Holoenzyme having successfully assembled at the promoter is called the

The "transcriptionally ready state," represented by Đ„„, is determined by assuming that protein-DNA interactions at the promoter are in chemical equilibrium We enumerate all possible regulatory states and calculate their probabilities using a canonical-like partition function The likelihood of the promoter existing in the "transcriptionally ready state" is derived from the sum of regulatory states that involve the Holoenzyme bound to the promoter.

The probability of the promoter existing at one of its regulatory states is calculated by using hjexp Ges )

Lihiexp (3#) AGI refers to the total Gibbs free energy (AŒ?”) and the density of energy-equivalent microstates (h;) for the i” regulatory state The equation incorporates the gas constant (R) and operates at a physiological temperature of 37°C.

At 310 Kelvin, the total Gibbs free energy is the cumulative sum of the Gibbs free energies from the individual molecular interactions present in that state The density of microstates, denoted as h;, can be calculated using the formula h; = |LacL|^n [Lac|:IPTG4]^m [TetR|^p [TetR:aTC2]^q [RNAP|^r.

Table 4.4: All 55 unique regulatory states of the synthetic promoter are shown with their corresponding Gibbs free energies and density of microstates. i | oO O2 O3 P| aaie

5 — Tet:aTC — — A Gier:aTC—DNA

7 — — Tet:aTC — AG tetiuTC-DNA

8 Lac Tet — — AOIz¿e—DNA + AGrer—DNA

9 Lac Tet:aTC — — AGlue—DNA + AG retaTC-DNA

10 Lac — Tet — AOT„e—DNA + A tai —DNA

11 Lac —_ Tet:aTC —_ A Glue— DNA +A Gret:uTC-—DNA

12 | LacIPTG Tet — — AGi¿e:TPTG—DNA + AGter—DNA

13 | LacIPTG TetaTC — — Aue:IPTG—DNA + AGtet:aTC—DNA

14 | LacIPTG — Tet — AOl¿e1PTG-DNA + Â ra —DNA l§ | LacIPTG — Tet:aTC — AGlac:IPTG-DNA + AGteruTC-DNA

17 — Tet:aTC Tet — AGieraTC_DNA + AGret_DNA

18 — Tet Tet:aTC — A Gai —DNA + A teruTC—DNA

19 — Tet:aTC Tet:aTC — 2 AGtetraT C_DNA

20 Lac Tet Tet — AO„e—pNA +2 AOz‡— DNA

21 Lac Tet:aTC Tet — AOlze—DNA + AGrer—pNA + Á te:uTCT—DNA

22 Lac Tet Tet:aTC — AOI¿e—DNA + AGreruTC—DNA + Atar—DNA

23 Lac Tet:aTC Tet:aTC — AOge—DNA + 2 AGiet-uTC—DNA

24 | Lac:IPTG Tet Tet — Age:IPTG—DNA + 2 AGret—DNA

25 | LacIPIG TetaTC Tet — AGlueIPTG-DNA + ACrtar—DNA + Á ret:¿TC— DNA

26 | Lac:IPTG Tet Tet:aTC — AOlue:IPTGT—DNA + AGtetuTC—DNA + AGter—DNA

27 | LacIPTG Tet:aTC Tet:aTC — AGtac:IPTG—DNA + 2 AGtet:aTC—DNA

Holoenzyme and One Repressor Bound

29 Lac — — RNAP/ | Aize_—DwA + AGRNAP-DNA

30 Lac:IPTG — —— RNAP/G A Gige:IPTG—DNA +A GRNAP—DNA

31 — Tet — RNAP/G | AG/¿_DÐNA + AGRNAP_DNA

32 — Tet:aTC — RNAP/G | AGiz:uTCT—DNA + ACRNAP-DNA

33 — — Tet RNAP/o | AGIz;—DNA + AGRNAP—DNA

34 — — Tet:aTC RNAP/G | AG¡z;z7C_—DNA + AGRNAP—DNA

Table 4.5: (Continued) All 55 unique regulatory states of the synthetic promoter are shown with their corresponding Gibbs free energies and density of microstates. i | ol O2 O3 Pp | agte

Holoenzyme and Two Repressors Bound

35 Lac Tet — RNAP/G A Gtuc—DNA + AGrer_pna + AORNAP—DNA

36 Lac Tet:aTC — RNAP/G | AGiue—pwa + A¡eruTCT—DNA + AGRNAP-DNA

37 Lac — Tet RNAP/G A Gtuc—DNA + AGrer_pna + AGRNAP_DNA

38 Lac — Tet:aTC RNAP/o A Gtuc—DNA +A Gtet:uTC—DNA + AŒRNAP_— DNA

39 Lac:IPTG Tet — RNAP/G A Glac:IPTG—DNA + AGier_pna + AGRNAP-DNA

40 Lac:IPTG Tet:aTC — RNAP/G A Gtac:IPTG-DNA +A GreraTC—DNA + AGrnap_pDNA

41 Lac:IPTG — Tet RNAP/G A Gtac:IPTG-DNA + AGrer_pna + AGRNAP_DNA

42 | Lac:IPTG — Ter:aTC RNAP/G | AuejprG—DNA + AGtet:aTC-pDNA + AGRNAP—DNA

43 — Tet Tet RNAP/G | 2 AG¿_—DxA + AGRNAP_—DNA

44 — Tet:aTC Tet RNAP/G | AGrerarc—pna + ta —DNA + AGRNAP—DNA

45 — Tet Tet:aTC RNAP/o A Gret—DNA +A Gret:aT C—DNA +A GRNAP—DNA

46 — Tet:aTC Tet:aTC RNAP/G | 2 AGz:z7C_-DNA + AORNAP_—DNA

Holoenzyme and Three Repressors Bound

47 Lac Tet Tet RNAP/G | AO,e—pwA +2 AGte—pNA + AGRNAP-DNA

48 Lac Tet:aTC Tet RNAP/G AOfge—DNA + AGrer—pNA + AGtetuTC-DNA + ÄRNAP—DNA

49 Lac Tet Tet:aTC =RNAP/O | AGs_pxa + AGreratc—DNA + AGret—pNa + AỞRNAP—DNA

50 Lac Tet:aTC TetraTC RNAP/G | AGze pxA + 2 AOe:u¿TC—DNA + AGRNAP-DNA

31 Lac:IPTG Tet Tet RNAP/o A Glac:IPTG—DNA +2 AGrer—pna + AGRNAP-DNA

52 | Lac:IPTG Tet:aTC Tet RNAP/G | AGi¿ip7g—DNA + AGre—pNA + AGreraTC—DNA + AGRNAP-DNA

53 | Lac:IPTG Tet Tet:aTC RNAP/G | AGeiprd—DNA + AGrerarc—pNa + AGrer—pna + Á GRNAP—DNA

34 | LacIPTG Tet:aTC Tet:aTC RNAP/G | AGige-rprg—pna + 2 AGet:uTC—DNA + AGRNAP-DNA

Repressor Non-Specific Binding 5Š | — — — — | AGys†

The density of microstates for the non-specific binding of repressor to genomic DNA is defined by the equation hị = [Laca] x [Lac:IPTGa] x [TetR2] x [TetR:aTC;], where the variables represent the participation of corresponding species in the regulatory state This formula, akin to the mass action rate law, quantifies the probability of simultaneous binding of multiple molecules at the promoter It is applied across 55 distinct regulatory states of the synthetic promoter, as detailed in Tables 4.4 and 4.5, encompassing all combinations of free and inducer-bound Lac and Tet repressors, along with the binding states of the Holoenzyme to the promoter.

4.5 Combining Experimental and Model Results

Calculating an Unknown Parameter 2 1 uc vu 229

Using the base kinetic and thermodynamic parameters in Table 4.3, we solve Eq (4.4) over the sixteen different concentrations of aTC and IPTG inducer (from 0.001 to 2 mM IPTG and from

The steady-state average GFP fluorescence was measured at inducer concentrations ranging from 1 to 200 ng/mL aTC, as illustrated in Figure 4.11C Initially, the model's response for the synthetic promoter did not align with the experimental data, as evidenced by the comparison between Figures 4.11A and 4.11C.

IPTG [mM] 00 aTC [ng/mL] IPTG [mM] aTC [ng/mL]

IPTG [mM] 00 aTC [ng/mL]

The steady-state average GFP fluorescence across sixteen concentrations of aTC and IPTG, with background autofluorescence removed, is analyzed alongside a mathematical model featuring AGo,-rwap at +25 kcal/mol and AGo,-gxap at +0.5210 kcal/mol, while other kinetic and thermodynamic parameters remain constant Among the parameters listed in Table 4.3 that influence the inducer-dependent probability of the "transcriptionally ready state" (Piniz), only AGrnyap—_pna, AGol—rnap, AGo2_—gnxaAp, and AGo3_rnap are estimated However, variations in AGryap—pna, AGo1—rnap, AGo2_—RNap, and AGo3_gxap do not align with the experimental data Notably, reducing AGo_rnap to 0.52 kcal/mol allows the model to closely match the experimental results, indicating that the steric interaction between Lac repressors and RNA polymerase is relatively weak, corroborating the observed experimental findings.

The L1 Norm, illustrating the relationship between experimental data and model results, is depicted in Figure 4.12 as a function of the thermodynamic parameter AGo,-pnap Notably, the global minimum occurs at AGo,-gwap = +0.5210 kcal/mol.

Predicting the Behavior of Improved Synthetic Promoters

The synthetic promoter developed only partially meets its intended design goals, exhibiting a fuzzy AND logical gate response to aTC and IPTG inducers, characterized by two distinct levels of gene expression To enhance the design of the synthetic promoter for a more definitive AND-like response, we employ a mathematical model to simulate the effects of rearranging the genetic components within the promoter This involves swapping the positions of the lacO1 and tetO2 operators while maintaining all other parameters constant, allowing us to predict the outcomes of these modifications.

The proposed synthetic promoters are illustrated in Figure 4.13, showcasing their annotated sequences Due to the limitation of three overlapping operators—one upstream, one downstream, and one in the spacer—we must strategically select the repressor that will independently inhibit gene expression and determine its binding location Our approach involves positioning the repressor at either the spacer or downstream sites to create sufficient steric hindrance with RNA polymerase, thereby effectively repressing gene expression.

The predicted steady-state average GFP fluorescence in response to aTC and IPTG chemical inducers for various synthetic promoters is illustrated in Figure 4.14 Our analysis assumes that the operators in the spacer and downstream positions effectively prevent RNA polymerase binding, with AGg2_pna at +1.5 kcal/mol and AGo3_pna at +1.0 kcal/mol Anecdotal evidence suggests that the middle operator in the spacer is more efficient at steric repression than the downstream operator Additionally, we have assumed equivalent steric interactions between RNA polymerase and both Lac and Tet repressors, although this may not be accurate Despite these assumptions, the findings may reveal general trends that could enhance the design of synthetic promoters.

The first synthetic promoter places the sole lacO1 operator in the upstream position The model

TCCCTATCAGTGATAGAGA TTGACA TCCCTATCAGTGATAGA GATACT AATTGTGAGCGGATAACAATT AGGAAACCGGTTC ATG

TCCCTATCAGTGATAGAGA TTGACA TTGTGAGCGGATAACAA GATACT TTCCCTATCAGTGATAGAGA AGGAAACCGGTTC ATG

AATTGTGAGCGGATAACAA TTGACA TIGIGAGCGGATAACAA GATACT TTCCCTATCAGTGATAGAGA AGGAAACCGGTTC ATG

-35 -10 ED F7 lacO1 lacOl tetO2 RBS

TCCCTATCAGTGATAGAGA TTGACA TTGTGAGCGGATAACAA GATACT AATTGTGAGCGGATAACAATT AGGAAACCGGTTC ATG

The proposed modifications to the synthetic promoter involve strategic placements of tetO2 and lacO1 operators to enhance gene expression responses Specifically, configurations include two tetO2 operators flanking a lacO1 operator, or vice versa, and variations where both upstream and middle positions contain lacO1 operators with tetO2 downstream These arrangements aim to create an AND logic gate response, where gene expression peaks only in the presence of both aTC and IPTG, resulting in significantly improved expression compared to the original promoter Alternative configurations show varying levels of gene expression, with one arrangement yielding a lower expression plateau in the absence of IPTG and a higher plateau without aTC The positioning of the repressor is crucial; placing lacO1 operators in optimal locations while positioning tetO2 operators strategically can enhance the fidelity of the AND response, thereby minimizing gene expression when either inducer is absent.

IPTG [mM] ATC [ng/ml] IPTG [mM] ATC [ng/ml]00

Average GFP Fluorescence IPTG [mM] ATC [ng/ml] IPTG [mM] ATC [ng/ml]200 200 Average GFP Fluorescence 0 0

The predicted steady-state average GFP fluorescence of the synthetic promoters, assuming specific Gibbs free energy values, reveals important insights into their design The first synthetic promoter exhibits the most AND-like logical response, yet shows significant plateaus in gene expression without aTC or IPTG due to the optimal placement of tetO2 and lacO1 operators In contrast, positioning the lacO1 operator centrally while keeping tetO2 operators in less effective positions leads to higher gene expression plateaus without aTC Additionally, a single tetO2 operator in a moderately efficient position also influences expression levels, while placing one in the least efficient position results in a notable increase in the low plateau of gene expression in the absence of both inducers.

Using a quantitative model allows for the swift evaluation of alternative synthetic promoters with modified DNA sequences The subsequent phase involves constructing and characterizing these synthetic promoters to assess the alignment between experimental data and model predictions In cases of discrepancies, we can identify the parameters responsible for these differences and adjust their values accordingly Notably, the estimated parameters AGoa-pxa and AGg3_pna play a significant role in determining the variations between experimental results and model outputs, facilitating their calculation.

The objective was to develop a synthetic promoter that necessitates high concentrations of both IPTG and aTC for activating the gfp reporter gene expression The initial design featured an overlapping lacO1 operator upstream of the -35 hexamer, along with two overlapping tetO2 operators in the spacer region and downstream of the -10 hexamer Slightly non-consensus -35 and -10 hexamers, coupled with an mRNA ribosome binding site that promotes secondary structure formation, were selected to minimize gene expression in the absence of the lac and tet repressors In response to the inducers, the synthetic promoter exhibits two distinct levels of gene expression At low IPTG concentrations with rising aTC levels, the promoter activates gfp expression, though at a sub-maximal rate However, when both IPTG and aTC are present, GFP production peaks at a rate over 20% higher than the initial plateau Thus, the synthetic promoter's response to inducers exemplifies a combination of OR and AND logic gates, often termed a fuzzy AND logic gate.

To understand the fuzzy AND logical behavior of the synthetic promoter, we developed a steady-state mathematical model that accounts for the specific molecular interactions and their kinetic and thermodynamic properties Unlike previous models, our approach treats each DNA operator as a distinct chemical species with its own interactions Additionally, we incorporate the chemical partition function to evaluate the probability of various regulatory states, including rare instances where both the Holoenzyme and repressor are bound simultaneously Furthermore, our model distinguishes between inducer-dependent and inducer-independent terms, enabling a separate analysis of the model's "high/low" levels and its response to inducers based on experimental data.

The model enables predictions on how rearranging operator sites in a synthetic promoter affects its activity in response to chemical inducers It does not assume the effectiveness of repressors in preventing holoenzyme assembly, avoiding reliance on arbitrary rate laws to fit experimental data Additionally, the model calculates unknown thermodynamic parameters using the synthetic promoter's response to inducers, without needing to estimate its minimum and maximum activities By focusing on changes in expression rates at varying inducer concentrations and utilizing experimentally measured parameters like repressor affinities, we can effectively reduce the degrees of freedom and determine the remaining unknowns.

Only changes in the steric interaction between the Lac repressor and RNA polymerase at the upstream operator can replicate the synthetic promoter's experimental data This allows us to calculate the thermodynamic Gibbs free energy of this interaction with high accuracy, despite it being an indirect measurement Direct measurement of this positive Gibbs free energy is not feasible, as it would necessitate accurately assessing the work needed to bring the Lac repressor and RNA polymerase together at chemical equilibrium Nonetheless, we have demonstrated that an accurate indirect measurement is achievable.

Utilizing newly calculated thermodynamic parameters, we apply a mathematical model to predict the inducer response of innovative synthetic promoters By exchanging the upstream lacOl and downstream tetO2 operators, we can develop a synthetic promoter that exhibits a more AND-like logical response These mathematical models can initially fit experimental data to determine missing parameters and subsequently be employed to design new synthetic promoters with varied behaviors.

Bifurcation analysis of deterministic differential equations reveals the reasons behind sudden changes in dynamical behaviors when system parameters are modified As we delve into smaller physical, chemical, and biological systems, the influence of thermal and mechanical random motions becomes increasingly significant These systems are often represented as jump Markov processes, characterized by a time-dependent probability distribution governed by a kinetic Master equation Stochastic bifurcation analysis examines how parameter changes qualitatively impact the steady-state solutions of the kinetic Master equation and the stability of system trajectories in both forward and reverse time.

In this chapter, we employ forward and reverse time stochastic simulation alongside an iterative sampling procedure to generate the first bifurcation diagram of a stochastic jump Markov process characterized by non-linear transition rates and multiple discrete states Utilizing the bistable chemical Schlügel model as a case study, we derive both stationary and non-stationary steady-state solutions for the forward and reverse kinetic Master equations, varying with a bifurcation parameter These bifurcation diagrams serve as essential tools for scientists and engineers in the analysis and design of systems affected by continuous thermal or mechanical noise.

5.2 Conceptual Background on Random Dynamical Systems

A random dynamical system can be better understood through an example rather than a complex definition Consider a first-order ordinary differential equation, dx/dt = f(x,t), with a specified initial condition By numerically integrating this equation over a defined interval, we generate a trajectory from the initial condition to a final state If we select a different initial condition and perform the integration again over the same time frame, we will obtain a distinct trajectory By choosing a fine grid of initial conditions within the domain and repeating the numerical integration for each, we can observe the varied outcomes produced by the system.

*This assumes existence and uniqueness of the solution

The ordinary differential equation captures the time evolution of numerous inter-related trajectories originating from various initial conditions within a specified domain This process is abstractly represented by a geometric operator known as the flow Specifically, a differential equation produces a flow ϕ(t, x) on the domain D, where ϕ(t, x₀) denotes the solution of the differential equation at time t, initialized from the point x₀.

Tiêu đề	Simulation of Stochastic Chemical Systems: Applications in the Design and Construction of Synthetic Gene Networks
Tác giả	Howard Michael Salis
Người hướng dẫn	Yiannis Kaznessis
Trường học	University of Minnesota
Chuyên ngành	Biology
Thể loại	Dissertation
Năm xuất bản	2007

Định dạng
Số trang	287
Dung lượng	37,75 MB