# Windowed Granger causal inference strategy improves discovery of gene regulatory networks

^{a}Interdisciplinary Biological Sciences, Northwestern University, Evanston, IL 60208;^{b}Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208;^{c}Center for Synthetic Biology, Northwestern University, Evanston, IL 60208;^{d}Chemistry of Life Processes, Northwestern University, Evanston, IL 60208

See allHide authors and affiliations

Edited by Douglas A. Lauffenburger, Massachusetts Institute of Technology, Cambridge, MA, and accepted by Editorial Board Member James J. Collins January 9, 2018 (received for review June 16, 2017)

## Significance

Discovery of gene regulatory networks (GRNs) is crucial for gaining insights into biological processes involved in development or disease. Although time-resolved, high-throughput data are increasingly available, many algorithms do not account for temporal delays underlying regulatory systems—such as protein synthesis and posttranslational modifications—leading to inaccurate network inference. To overcome this challenge, we introduce Sliding Window Inference for Network Generation (SWING), which uniquely accounts for temporal information. We validate SWING in both in silico and in vitro experimental systems, highlighting improved performance in identifying time-delayed edges and illuminating network structure. SWING performance is robust to user-defined parameters, enabling identification of regulatory mechanisms from time-series gene expression data.

## Abstract

Accurate inference of regulatory networks from experimental data facilitates the rapid characterization and understanding of biological systems. High-throughput technologies can provide a wealth of time-series data to better interrogate the complex regulatory dynamics inherent to organisms, but many network inference strategies do not effectively use temporal information. We address this limitation by introducing Sliding Window Inference for Network Generation (SWING), a generalized framework that incorporates multivariate Granger causality to infer network structure from time-series data. SWING moves beyond existing Granger methods by generating windowed models that simultaneously evaluate multiple upstream regulators at several potential time delays. We demonstrate that SWING elucidates network structure with greater accuracy in both in silico and experimentally validated in vitro systems. We estimate the apparent time delays present in each system and demonstrate that SWING infers time-delayed, gene–gene interactions that are distinct from baseline methods. By providing a temporal framework to infer the underlying directed network topology, SWING generates testable hypotheses for gene–gene influences.

Elucidating gene–gene regulation is a fundamental challenge in molecular biology, and high-throughput technologies continue to provide insight about the underlying organization, or topology, of these interactions. Accurate network models representing genes (nodes) and regulatory interactions (edges) infer information from many observed heterogeneous components while minimizing the effects of noise and hidden nodes. Many methods infer gene regulatory networks (GRNs) from expression profiles (1), but each suffers from limitations—assumptions of linearity, univariate comparisons, or computational complexity—and most ignore temporal information in time-series data. Understanding the temporal dynamics of gene/protein expression is critical to elucidating responses involved in cell cycle, circadian rhythms, DNA damage, and development (2⇓⇓–5).

Existing methods to infer GRNs from time-series expression profiles include dynamical models, statistical approaches, and hybrids of the two (1, 6⇓–8). Dynamical systems models of differential equations can forecast future system behaviors and characterize formal properties such as stability (9), but these models are computationally intractable for large GRNs due to extensive and explicit parameterization requirements (10). Statistical inference methods—such as regression schemes, mutual information, decision trees, and Bayesian probability (11⇓–13)—make no explicit mechanistic assumptions and are often more computationally efficient than dynamical models. However, many implementations of aforementioned algorithms treat time points as independent observations, disregarding time delays associated with transcription, translation, and other processes inherent to gene regulation (14, 15). Hybrid methods—such as SINDy and Jump3—use statistical methods to optimize the search and parameterization of dynamical models, but they remain computationally expensive and rely on accurate specification of basis functions (16, 17).

If the experimental sampling interval is less than or equal to the time delay between a regulator and its downstream target, it is possible to use Granger causality to incorporate intrinsic delays that are often hidden from measurement (18). Current implementations of Granger causal network inference methods are limited: The inference (*i*) is conducted pairwise, prohibiting simultaneous assessment of multiple upstream regulators; (*ii*) has a single user-defined delay, which assumes a uniform delay between all regulators and their targets; or (*iii*) requires each explanatory variable, assessed at multiple delays, to be selected as a group (19⇓⇓⇓–23). Thus, their implementation has limited broad utility in biological systems with heterogeneous time delays.

To allow for multiple time delays to affect downstream target nodes, we introduced an extensible framework to infer GRNs from time-series data, termed Sliding Window Inference for Network Generation (SWING). SWING embeds existing multivariate methods, both linear and nonlinear, into a Granger causal framework that concurrently considers multiple time delays to infer causal regulators for each node. SWING also uses sliding windows to create many sensitive, but noisy, inference models that are aggregated into a more stable and accurate network. We validated the efficacy of SWING on several in silico time-series datasets and existing in vitro datasets with corresponding gold standard networks. We show that SWING reconstructs networks more accurately than baseline methods and demonstrate that this performance boost is partly attributed to accurately inferring edges that involve an identifiable time delay between upstream regulators and targets. In validation studies analyzing networks derived from *Escherichia coli* and *Saccharomyces cerevisiae*, SWING inferred networks with distinct topologies and can therefore be combined with other methods to improve consensus models. The SWING framework is available for use and can be found on GitHub (https://github.com/bagherilab/SWING).

## Results

SWING integrates multivariate Granger causality and ensemble learning to infer interactions from gene expression data. First, SWING subdivides time-series data into several temporally spaced windows based on user-specified parameters (Fig. 1*A*). For each window, edges are inferred from the selected window and previous windows, representing interactions with specific delays. This inference results in a ranked list of time-delayed gene–gene interactions for each window (Fig. 1*B*). The ensemble of models is aggregated based on edge rank into a static GRN (Fig. 1*C*). In silico and in vitro validation confirmed notable performance improvements.

### SWING Improves the Inference of in Silico GRNs.

We applied SWING to reconstruct in silico GRNs simulated by GeneNetWeaver (GNW) (24). A total of 20 subnetworks with 10 nodes and nonisomorphic topologies were extracted from *E. coli* and *S. cerevisiae* networks included in GNW to use as gold standards. Networks were inferred from the generated time-series data by using existing multivariate methods as a basis for comparison. We used RandomForest (RF), least absolute shrinkage and selection operator (LASSO), and partial least-squares regression (PLSR) (11, 12, 25), which represent the areas of sparse, nonlinear, and PLS-based regression. We implemented the SWING chassis and compared the performance of each SWING frontline method with its base method: SWING-RF vs. RF, SWING-LASSO vs. LASSO, and SWING-PLSR vs. PLSR.

To capture short-term dynamics consistent with simulated perturbations, we set the window size to approximately half the duration of the time series. The minimum and maximum lags were set to *A* and *SI Appendix*, Table S1) and across all of the 100-node networks (*SI Appendix*, Fig. S1 and Table S1). In particular, RF received the most notable benefit from SWING; SWING-RF outperformed RF in 39 out of 40 in silico networks, and application of SWING-RF resulted in the highest mean AUROC and AUPR for in silico networks among tested methods.

### SWING Infers Distinct Edges in Networks.

No single method performs optimally across all datasets, partially due to biases in predicting different network topologies. For example, *E. coli*-derived networks predominately feature fan-out motifs, which RF infers with greater sensitivity. In contrast, *S. cerevisiae*-derived networks contain more cascade motifs, which are inferred with greater sensitivity by linear methods (14).

To determine if SWING methods provide distinct information from RF, LASSO, and PLSR (R/L/P), we ran principal component analysis (PCA) on ranked edge lists predicted by SWING and the corresponding base methods (Fig. 2*B*). We discarded PC1 because it largely explains the overall performance of each inference method (58% variance explained; *SI Appendix*, Fig. S2). Clustering of results in PC2 and PC3 seemed to explain biases toward specific network motifs (14). Along PC2, edge rankings appeared to separate based on the internal base method (15% variance explained), while along PC3, SWING edge rankings appeared to separate from those of their base methods (5% variance explained). These results suggest that SWING recovered connectivities that were distinct from those recovered from R/L/P.

Given that it is difficult to determine a priori which methods perform optimally in different contexts, deriving a community network is a good strategy for robustly improving predictions (14). We evaluated the performance of SWING-Community, which combines SWING-RF, -LASSO, and -PLSR predictions by calculating the mean rank across all methods for each possible edge. We note that SWING-Community outperformed RF, resulting in a 52% and 8% mean increase in AUPR and AUROC, respectively, suggesting that SWING infers distinct and complementary networks (*SI Appendix*, Fig. S3).

### SWING Improves Network Inference by Promoting Time-Delayed Edges.

Endogenous reactions, such as protein translation, posttranslational modifications, translocation, or oligomerization, are often not accounted for in GRNs. However, even if underlying network kinetics are linear (or approximately linear), the resulting dynamics can appear delayed when not all nodes are observed (*SI Appendix*, Fig. S4*A*). Delayed behavior in gene expression and protein translation has been established in several studies (26, 27).

We estimated the apparent time delay of each interaction in a 10-node GNW network by calculating the pairwise peak cross-correlation between time series of all true regulator and target combinations. The majority of true interactions within GNW networks had a time delay between 0 and 150 min (*SI Appendix*, Fig. S4*B*). We observed that SWING was more likely to promote edges with an identifiable delay within the range of user-specified parameters (*SI Appendix*, Fig. S5*A*). Across all in silico networks, SWING-RF promoted 65.8% of true edges with a delay vs. 55.4% of true edges without a delay (*P* = 0.018), and SWING-PLSR promoted 67.0% of true edges with a delay vs. 47.1% of true edges without a delay (*P* = 6.00e-6) (*SI Appendix*, Fig. S5*B*).

Many of the promoted edges with an identifiable delay were highly ranked by base methods RF and PLSR. In general, delayed true edges ranked in the first quartile by the base method were likely to be promoted, while those ranked lower were no more likely to be promoted than nondelayed true edges (*SI Appendix*, Fig. S5*B*). While SWING was more likely to promote true edges with a delay, the magnitude of this promotion was not consistent across the different base methods or networks. SWING-RF promoted true edges with an apparent time delay by an average of 7.50 ranks relative to true edges without an apparent time delay (*P* = 4.75e-3) for *S. cerevisiae*-derived networks. In contrast, SWING-PLSR promoted true edges with an apparent delay by an average of 7.78 ranks relative to true edges without an apparent time delay (*P* = 6.89e-5) for *E. coli*-derived networks (*SI Appendix*, Fig. S5*B*). In one example, *S. cerevisiae* network 12, SWING-RF improved the AUROC from 0.539 to 0.872, a 61.7% increase relative to the base method. Compared with RF, the edge ranking for SWING-RF promoted many true edges, and all of the true edges with a delay were promoted by SWING (*SI Appendix*, Fig. S6*A*).

To demonstrate how SWING promoted delayed edges, we highlighted the true edge between gene 2 (G2) and gene 1 (G1) in *S. cerevisiae* network 12. G2 is the only node upstream of G1, and the input data included an experiment where only G2 was perturbed; thus, the delay between G2 stimulation and G1 response was unambiguously isolated (*SI Appendix*, Fig. S7*A*). We estimated the delay between G2 and G1 as two time points, or 100 min. We shifted the G1 time series by two time points to show that the Pearson correlation of the resulting time series notably increases (*SI Appendix*, Fig. S6*B*).

### SWING Infers Apparent Time-Delayed Edges with Greater Sensitivity in the *E. coli* SOS Network.

We applied SWING to an in vitro eight-node *E. coli* GRN that activates with DNA damage (20, 28). The SOS network contains several complex interactions, including multiple cascades and feedback loops generated by a combination of transcriptional activators and repressors. We computed the mean of three replicates for each time point following DNA damage induced by norfloxacin treatment (29).

The sampling strategy for the in vitro SOS data are different from that of the in silico GNW data. Due to fewer time points, we were restricted to assessing interactions with shorter possible time delays. Using *P* = 1.41e-13) and the AUROC from 0.756 to 0.819 (8.3%, *P* = 5.28e-34). To assess promotion of time-delayed edges, we calculated the mean edge ranks across all 50 runs and compared the resulting lists. Although SWING-RF demoted some true edges, it promoted all three edges that exhibited a time delay (Fig. 3*A*). We highlighted the edge between *lexA* and *umuDC* (*SI Appendix*, Fig. S7*B*), which had an estimated lag of 6 min. When the *umuDC* time series was shifted by this amount, the correlation between *lexA* and *umuDC* increased from 0.709 to 0.928 (Fig. 3*B*). These findings reaffirmed that SWING improves network inference, in part, by promoting edges with identifiable delays.

### SWING Accurately Infers RegulonDB Modules with Time-Delayed Edges.

We curated microarray data to infer time-delayed edges from experimentally validated GRNs in *E. coli* (Fig. 4*A*) and *S. cerevisiae* (*SI Appendix*, Fig. S8). This curated data were aggregated across 18 datasets for *E. coli* and 8 datasets for *S. cerevisiae*, where data were unevenly sampled for time intervals that ranged from 5 to 120 min (*SI Appendix*, Table S2). To assess the landscape of apparent time delays present in these gene expression data, we performed pairwise cross-correlation lag selection between experimentally confirmed edges (31). We reveal that of 2,870 experimentally confirmed edges, only 23.7% exhibited an apparent time delay of 0, and 13.7% exhibited a time delay of at least 10 min. Surprisingly, only 37.4% of confirmed edges exhibited pairwise correlation (R > 0.7, *P* < 1e-5; Fig. 4*A*).

To determine whether lag is associated with modularity and function, we clustered the *E. coli* and *S. cerevisiae* network into smaller modules using MCODE (32) and performed gene ontology enrichment analysis. Several modules, such as those associated with catabolic processes and metal ion binding, were enriched with time-delayed edges of at least 10 min (*SI Appendix*, Tables S3 and S4). Transcription factors known to regulate genes on a global or combinatorial scale tend to exhibit similar time delays (*SI Appendix*, Table S5).

To determine if SWING more accurately infers network structure in diverse contexts, we performed cubic spline interpolation to generate evenly sampled time-series gene expression at 10-min intervals and benchmarked SWING-Community performance against an ensemble model of R/L/P base for each clustered module using this dataset. SWING-Community outperformed R/L/P in subnetworks in which >10% of edges were time-delayed (*n* = 26 clusters, 9 clusters with <10 genes, or <3 transcription factors were removed from analysis, *P* = 0.031; Fig. 4*B*). As an example, we identified time-delayed properties of key regulators of the *tdcABC E. coli* operon that are responsible for the transport of threonine and serine during anaerobic growth (33). In particular, our analysis identified two global transcription factors that bind combinatorially to induce activity in the *tdcABC* operon. *Crp* and *fnr* are global regulators that respond to glucose starvation and anaerobic growth, respectively (34, 35).

Interestingly, lag analysis identified 10- and 20-min time delays between *crp* and target genes in the *E. coli tdcABC* operon. While the precise delay identified by our analysis was not consistent with that observed in experiments, studies confirmed that a delay existed between *crp* induction and the induction of several target genes (36). This delay can possibly be attributed to posttranslational modification of *crp* (37). Of 32 edges in the gold standard, SWING identified 27 true-positive (TP) edges and 5 false-positive (FP) edges (85% TP), while the ensemble model predicted 24 TP edges and 8 FP edges (75% TP). In this example, SWING-Community inferred both time-delayed and non-time-delayed edges more sensitively than the R/L/P ensemble model. The FP edges inferred by SWING-Community were also within the subset of FP edges inferred by the base community method.

### SWING Performance Is Robust Across Parameters.

SWING adds user-defined parameters to baseline methods, which are necessary for window creation and time-delay inference. The selection of these parameters was both context- and data-specific. We conducted parametric sensitivity analysis of SWING as a function of window size, combinations of *E. coli* SOS network (*SI Appendix*, Figs. S9–S14). While SWING outperformed baseline methods over a wide range of window sizes (*SI Appendix*, Fig. S9), the performance of a single network may differ from other networks, suggesting that the optimal window size is partially dependent on the underlying inference method and network structure. Therefore, user-specified SWING parameters—*SI Appendix*, *Sensitivity Analysis*. Overall SWING outperforms baseline methods for a wide range of possible parameters (*SI Appendix*, Figs. S9–S13).

## Discussion

Tight regulation of gene expression is critical to maintaining robust responses to perturbations and environmental disturbances, and misregulation of intracellular signaling dynamics can lead to a wide variety of diseases. For this reason, uncovering the topology of GRNs is of fundamental interest to the scientific community, since the resulting maps can be used to identify interventions to control cellular phenotypes. Many current methods disregard temporal information and are limited in their ability to accurately infer network topology. Indifference to time delays will be the Achilles heel of many systems biology strategies. We developed a general temporal framework for network inference that accurately uncovers the regulatory structures governing complex biological systems by accounting for these fundamental delays. SWING improves upon existing Granger methods by generating an ensemble of windowed models that simultaneously evaluate multiple upstream regulators at several potential time delays. We validated its utility and performance in several in silico (Fig. 2*A*) and in vitro (Figs. 3 and 4*B*) systems.

### Consideration of Time Delays Improves SWING Performance and Should Be Integrated in Experimental Design.

Our in silico and in vitro results demonstrate that promoted edges were enriched for those with apparent time delays (*SI Appendix*, Fig. S5*B*), suggesting that network inference was improved, in part, by accounting for temporal information. We supported this finding by demonstrating that SWING-RF promotes an edge with a distinct and singular delay (*SI Appendix*, Fig. S6*A*). We also used SWING to predict directed edges of several *E. coli* subnetworks using cubic spline interpolated microarray datasets. Through cross-correlation analysis, we estimated time-delayed interactions in in silico, *E. coli*, and *S. cerevisiae* networks, and showed that SWING performed better than baseline methods in modules with more frequent time-delayed edges, such as the *tdcABC* regulon.

Interestingly, the apparent time delay only partially explained improved performance, as SWING also promoted edges without apparent time delays in in silico and in vitro networks. This discrepancy may have arisen from our conservative approach for identifying time delays; a more liberal approach could assign time delays to a greater fraction of the promoted edges. However, it is particularly challenging to estimate time delays for genes with multiple regulators by using cross-correlation. More complex algorithms that incorporate additional information (i.e., nonlinearity and partial correlation) could improve time-delay estimation between regulators and targets (38).

An additional consideration involves interactions that occur faster than the sampling interval. These interactions will not exhibit a delay in the time series and will resist inference and estimation of time delay regardless of methodology. This bottleneck can be managed by designing experiments with shorter sampling intervals. The choice of sampling interval is context-specific, and we recommend sampling with sufficient frequency to capture dynamics of interest.

### SWING Outperforms Common Network Inference Algorithms Across Scales.

SWING outperforms common network inference algorithms—R/L/P—but is limited by computational expense. Since SWING constructs a larger explanatory matrix and executes multivariate comparisons between multiple time delays, it is more expensive than the aforementioned methods. Fortunately, SWING is trivially parallelizable and can be implemented on any multicore processing system. We conducted similarly derived 100-node in silico networks and found that SWING increased the AUPR and AUROC for all three methods (*SI Appendix*, Fig. S1), including SWING-LASSO, which had no significant difference for the 10-node networks (Fig. 2*A*). Remarkably, every single network was inferred with greater accuracy, indicating that SWING has notable benefits for larger inference tasks (*SI Appendix*, Fig. S1 and Table S1).

### SWING Is an Extensible Framework.

Compared with other time-delayed inference algorithms, SWING is a flexible and extensible framework that is not limited to using a single statistical method. The SWING framework was implemented with R/L/P; it can be easily expanded to use other multivariate inference algorithms, including those that use prior information and heterogeneous data types (39). Additional improvements can be made by incorporating complex weighting of methods for consensus analysis that leverage known weaknesses and biases of inference methods. Methods that involve empirical optimization of combination weights, such as those assessed in the DREAM challenge, are expected to substantially improve SWING performance (40).

Although we implemented SWING to infer interactions from gene expression data, the same Granger causality principles can be applied to a wide variety of contexts with temporal dynamics. Provided sufficient time-series data, we expect SWING to identify regulatory relationships in related intracellular signaling pathways, as well as broader fields such as ecology, social sciences, and economics. As the sensitivity/specificity of experimental tools increases and the cost of implementation decreases, we expect longer and higher-resolution time-series data to become widely available. We expect this increase in time resolution to further improve the accuracy of SWING-based network inference, especially as the community continues to build on the SWING chassis. The SWING framework, with currently implemented methods, is available on GitHub (https://github.com/bagherilab/SWING).

## Materials and Methods

The SWING algorithm is described in detail in *SI Appendix*, *SI Materials and Methods*, including parameter selection, management of time-series data and window creation, model aggregation, and graph generation. In silico simulations and in vitro data aggregation are also described in *SI Appendix*, *SI Materials and Methods*. The sensitivity of SWING performance as a function of user-defined parameters is described in *SI Appendix*, *SI Sensitivity Analysis*.

## Acknowledgments

This research was supported, in part, by Biotechnology Training Program Grant T32 GM008449 (to J.D.F.), NIH National Heart, Lung, and Blood Institute Award F31HL134331-02 (to J.J.W.), NSF CAREER Award CBET-1653315 (to N.B.), the Quest high performance computing facility, and the McCormick School of Engineering at Northwestern University.

## Footnotes

↵

^{1}J.D.F. and J.J.W. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: n-bagheri{at}northwestern.edu.

Author contributions: J.D.F., J.J.W., and N.B. designed research; J.D.F. and J.J.W. performed research; J.D.F. and J.J.W. analyzed data; and J.D.F., J.J.W., and N.B. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. D.A.L. is a guest editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1710936115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- ↵
- Spellman PT, et al.

*Saccharomyces cerevisiae*by microarray hybridization. Mol Biol Cell 9:3273–3297. - ↵
- Geva-Zatorsky N, et al.

- ↵
- ↵
- Madar A,
- Greenfield A,
- Ostrer H,
- Vanden-Eijnden E,
- Bonneau R

- ↵
- ↵
- ↵
- Zak DE,
- Gonye GE,
- Schwaber JS,
- Doyle FJ

- ↵
- ↵
- Tibshirani R

- ↵
- ↵
- Lawrence ND,
- Sanguinetti G,
- Rattray M

- ↵
- ↵
- ↵
- Brunton SL,
- Proctor JL,
- Kutz JN

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Ciaccio MF,
- Chen VC,
- Jones RB,
- Bagheri N

- ↵
- ↵
- McAdams HH,
- Arkin A

- ↵
- Ronen M,
- Rosenberg R,
- Shraiman BI,
- Alon U

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Crack J,
- Green J,
- Thomson AJ

- ↵
- Kao KC,
- Tran LM,
- Liao JC

*Escherichia coli*revealed by transcriptome network analysis. J Biol Chem 280:36079–36087. - ↵
- ↵
- Runge J, et al.

- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Systems Biology

- Physical Sciences
- Biophysics and Computational Biology