bedtoolsgetfasta extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.
Tip
1. The headers in the input FASTA file must exactly match the chromosome column in the BED file.
2. You can use the UNIX fold command to set the line width of the FASTA output. For example, fold-w60 will make each line of the FASTA file have at most 60 nucleotides for easy viewing.
3. BED files containing a single region require a newline character at the end of the line, otherwise a blank output file is produced.
Specify an output file name. By default, output goes to stdout.
-name
Use the “name” column in the BED file for the FASTA headers in the output FASTA file.
-tab
Report extract sequences in a tab-delimited format instead of in FASTA format.
-bedOut
Report extract sequences in a tab-delimited BED format instead of in FASTA format.
-s
Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. Default: strand information is ignored.
-split
Given BED12 input, extract and concatenate the sequences from the BED “blocks” (e.g., exons)
Default behavior
bedtoolsgetfasta will extract the sequence defined by the coordinates in a BED interval and create a new FASTA entry in the output file for each extracted sequence. By default, the FASTA header for each extracted sequence will be formatted as follows: “<chrom>:<start>-<end>”.
One can optionally request that FASTA records be extracting and concatenating each block in a BED12 record. For example, consider a BED12 record describing a transcript. By default, getfasta will extract the sequence representing the entire transcript (intons, exons, UTRs). Using the -split option, getfasta will instead produce separate a FASTA record representing a transcript that splices together each BED12 block (e.g., exons and UTRs in the case of genes described with BED12).
If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.
For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.
You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. Hypothesis testing helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.
In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using Minitab statistical software.
In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the Minitab project file with the data.
Comparing Constants in Regression Analysis
When the constants (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.
To test the difference between the constants, we just need to include a categorical variable that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.
To fit the model in Minitab, I’ll use: Stat > Regression > Regression > Fit Regression Model. I’ll include Output as the response variable, Input as the continuous predictor, and Condition as the categorical predictor.
In the regression analysis output, we’ll first check the coefficients table.
This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.
The coefficient for Condition is 10 and its p-value is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.
Comparing Coefficients in Regression Analysis
When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can see that the slopes look different, but we want to be sure this difference is statistically significant.
How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.
We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!
In Minitab, you can specify interaction terms by clicking the Model button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:
The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.
It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.
John Tukey once said, “The best thing about being a statistician is that you get to play in everyone’s backyard.” I enthusiastically agree!
I frequently enjoy reading and watching science-related material. This invariably raises questions, involving other “backyards,” that I can better understand using statistics. For instance, see my post about the statistical analysis of dolphin sounds.
The latest topic that grabbed my attention was an apparent error in the BBC program Wonders of Life. In the episode “Size Matters,” Professor Brian Cox presents a graph with a linear regression line that illustrates the relationship between the size of mammals and their metabolic rate.
How Does the Size of Mammals Affect Their Lives?
Brian Cox, a theoretical physicist, is a really smart guy and one of my favorite science presenters. So, I was surprised when his interpretation of the linear regression model seemed incorrect. Below is a closer look at the graph he presents, and his claim.
Cox points out the straight line and states, “That implies, gram-for-gram, large animals use less energy than small animals . . . because the slope is less than one.”
For linear regression, the slope being less than 1 is irrelevant. Instead, the fact that it is a straight line indicates that the same relationship applies for both small and large mammals. If you increase mass by 1 unit for a small mammal and for a large mammal, metabolism increases by the same average amount for both sizes. In other words, gram-for-gram, size doesn’t seem to matter!
However, it’s unlikely that Cox would make such a fundamental mistake, so I conducted some research. It turns out that biologists use a log-log plot to model the relationship between the mass of mammals and their basal metabolic rate.
Perhaps Cox actually presented a log-log plot but glossed over the details?
If so, this dramatically changes the interpretation of this graph, because log-log plots transform both axes in order to model curvature while using linear regression. If the slope on a log-log plot of metabolic rate by mass is between 0 and 1, it indicates that the nonlinear effect of mass on metabolic rate lessens as mass increases.
This description fits Cox’s statements about the slope and how mass effects metabolic rate.
Let’s test Cox’s hypothesis ourselves! Thanks to the PanTHERIA database*, we can fit the same type of log-log plot using similar data.
Log-Log Plot of Mammal Mass and Basal Metabolic Rate
I’ll use the Fitted Line Plot in Minitab statistical software to fit a regression line to 572 mammals, ranging from the masked shrew (4.2 grams) to the common eland (562,000 grams). You can find the data here.
In Minitab, go to Stat > Regression > Fitted Line Plot. Metabolic rate is our response variable and adult mass is our predictor. Go to Options and check all four boxes under Transformations to produce the log-log plot.
The slope (0.7063) is significant (p = 0.000) and its value is consistent with recently published estimates. Because the slope is between 0 and 1, it confirms Cox’s interpretation. Gram-for-gram, larger animals use less energy than smaller animals. In order to function, a cell in a larger animal requires less energy than a cell in a smaller animal.
I’m quite amazed that the R-squared is 94.3%! Usually you only see R-squared values this high for a low-noise physical process. Instead, these data were collected by a variety of researchers in different settings and cover a huge range of mammals that live in completely different habitats.
So Cox presented the correct interpretation after all: for mammals, size matters. However, he presented an oversimplified version of the underlying analysis by not mentioning the double-log transformations. This is television, after all!
There are important implications based on the fact that this model is curved rather than linear. If the increase in metabolic rate remained constant as mass increased (the linear model), we’d have to eat 16,000 calories a day to sustain ourselves. Further, mammals wouldn’t be able to grow larger than a goat due to overheating!
Basal Metabolic Rate and Longevity
Cox also presented the idea that mammals with a slower metabolism live longer than those with a faster metabolism. However, he didn’t present data or graphs to support this contention. Fortunately, we can test this hypothesis as well.
In Minitab, I used Calc > Calculator to divide metabolic rate by grams. This division allows us to make the gram-for-gram comparison. Time for another fitted line plot with a double-log transformation!
The negative slope is significant (0.000) and tells us that as metabolic rate per gram increases, longevity decreases. The R-squared is 45.8%, which is pretty good considering that we’re looking at just one of many factors than can impact maximum lifespan!
However, it is not a linear relationship because this is a log-log plot. Maximum longevity asymptotically approaches a minimum value around 13 months as metabolism increases. The graph below shows the curvilinear relationship using the natural scale.
A one-unit increase in the slower metabolic rates (left side of x-axis) produces much larger drops in longevity than a on-unit increase in the faster metabolic rates (right side of x-axis).
Once again, we’ve shown that size does matter! Larger mammals tend to have a slower metabolism. And animals that have a slower metabolism tend to live longer. That’s fortunate for us because without our slower metabolism, we’d only live about a year!
Recent Comments