Labfans是一个针对大学生、工程师和科研工作者的技术社区。 论坛首页 | 联系我们(Contact Us)
MATLAB爱好者论坛-LabFans.com
返回   MATLAB爱好者论坛-LabFans.com > 其它 > 资料存档 > MATLAB技术文章
MATLAB技术文章 MATLAB Technical Articles From Mathworks
回复
 
主题工具 显示模式
旧 2008-01-06, 16:32   #1
TechnicalArticles
游客
 
帖子: n/a
默认 Using Genetic Algorithms to Select a Subset of Predictive Variables from a High-Dimen

Using Genetic Algorithms to Select a Subset of Predictive Variables from a High-Dimensional Microarray Dataset
by Sam Roberts
Because they enable the parallel measurement of large numbers of genes (as many as 25,000), microarrays are used extensively for basic and pharmaceutical research in a variety of applications, including drug discovery, toxicology, and oncology. The exploding amount of data generated by microarrays and other research methods, however, poses problems for experimental design, visualization, and statistical analysis and interpretation.
What is a Microarray?
A microarray consists of small lengths of DNA attached to a surface, typically glass or coated quartz. The DNA can either be spotted onto the surface using robotic technology or chemically synthesized in situ. Nucleic acids from an experimental sample are labeled chemically and hybridized to the DNA on the microarray. The microarray is then scanned to produce an image like Figure 1.
We can display these images using commands from the Bioinformatics Toolbox.
ma=affyread('0029_1209_H95Av2_KF0077.CEL');
maimage(ma);
Figure 1. Microarray image. The brightness of each spot is a measure of the level of expression of the gene it represents. A high level of expression indicates that the gene was active in the experimental sample. Click on image to see enlarged view.
Feature Selection

Many recent articles in the microarray literature have discussed the application of feature selection methods to high-dimensional datasets. Researchers can use these methods to select smaller subsets of interesting genes, aiding the interpretation of statistical models while retaining the highest possible degree of the accuracy of models developed on the full dataset. Sometimes, by allowing statistical learning algorithms to focus only on highly predictive genes while ignoring redundant variables and irrelevant noise, feature selection methods can even improve the accuracy of statistical models.
Feature selection methods are often separated into two groups: filter and wrapper methods. Filter methods generally rank each gene individually by some quality criterion (for example, the p-value of t-test comparing two populations of interest with regard to the expression levels of the gene in the populations), and then select the subset of genes with the n highest quality criteria. Wrapper methods use a search algorithm to evaluate subsets of the variables as a group, rather than individually. An exhaustive search through all subsets is clearly not possible—there could be around 225,000 variable subsets to consider. Therefore, these search algorithms use heuristics to direct their search towards promising candidates. One such search method is the genetic algorithm.
What is a Genetic Algorithm?

The name “genetic algorithm” is not connected to the fact that we are analyzing gene expression data. A genetic algorithm is a general-purpose method for searching through a large number of objects to find one that optimizes a quantity of interest, using methods derived from Darwinian evolution. The algorithm randomly selects an initial population of objects and their “fitness” is evaluated by calculating the quantity of interest. The best ones are selected to become “parents” of a “child” population, by exchanging properties (“crossover”), and occasionally mutating to create new properties. This process is iteratively repeated, converging on a population that optimizes the quantity of interest. In the analysis described in this article, the objects are subsets of genes, and the quantity to be optimized is how well the subset can predict lymph node status.
This article discusses the use of the Genetic Algorithm and Direct Search Toolbox to perform variable selection on a microarray dataset. Although our focus is on the analysis of microarray data, variable selection using genetic algorithms is a method applicable to many areas involving high-dimensional datasets: spectroscopy, quantitative structure-activity relationships, and many others.
Data

We will be working with the preprocessed microarray dataset. The original data is publicly available and described in “Gene Expression Predictors of Breast Cancer Outcomes” (Huang et al (2003) [1]). The authors obtained microarray data from samples of primary breast tumors from 89 patients, as well as information concerning each patient's clinical state (lymph node status).
Invasion of a tumor into axillary lymph nodes is a significant prognostic factor for breast cancer. Currently, the best method of classifying patients into subgroups of severity is the investigation by a pathologist of lymph node biopsies. However, this method is invasive, and also imperfectly predictive (22–33% of patients classified as low-risk go on to develop breast cancer).
The collection of gene expression data may add predictive value to the current clinical indicators, as it can provide new information that is believed to be relevant in sub-classifying tumors.
The authors originally used cluster analysis and principal component analysis to construct a small number of “metagenes” with which they developed a predictive model of patient status. In this article, we demonstrate the use of genetic algorithms to select a subset of the genes that is highly predictive of lymph node status.
Importing the Dataset

We downloaded our raw data directly from the authors' Website.
load HuangData
numSamples = size(signal,1);
numVariables = size(signal,2);
whos Name Size Bytes Class call 89x12625 1123625 logical array descriptions
1x12625 4049096 cell array geneNames 1x12625 965230 cell array lnNodeStatus 89x1 712 double array numSamples 1x1 8 double array numVariables
1x1 8 double array sampleNames 89x1 9918 cell array signal 89x12625 8989000 double array
Grand total is 4024632 elements using 15137597 bytes For ease of exposition, we preprocessed the data a little. In particular, the lymph node status has been re-ordered so as to match the order of the samples, and the microarray signal, absent/present call (see below), gene names, and gene descriptions have been extracted from the raw dataset into their own separate variables.
Dimensionality Reduction

We first remove all genes that are detected imperfectly or only at very low levels. The microarrays used in this study, which are from Affymetrix, give rise to an absent/present call for each gene (an indicator of whether the gene has been detected at significant levels). We include only genes that have been detected as present.
for i=1:numVariables calledPresent(i) = all(call(:,i)); end
geneNames = geneNames(calledPresent);
descriptions = descriptions(calledPresent);
signal = signal(:,calledPresent);
call = call(:,calledPresent);
numVariables = size(signal,2);
whos Name Size Bytes Class call 89x1447 128783 logical array calledPresent
1x12625 12625 logical array descriptions 1x1447 468102 cell array geneNames 1x1447 110798 cell array
lnNodeStatus 89x1 712 double array numSamples
1x1 8 double array numVariables 1x1 8 double array sampleNames 89x1 9918 cell array signal 89x1447 1030264 double array
Grand total is 478185 elements using 1761226 bytes Setting up the Genetic Algorithm

The default options for genetic algorithms in the Genetic Algorithm and Direct Search toolbox are most suitable for searching though continuous spaces. In this application, we are searching through a discrete space (subsets of variables), so we need to customize them. First, we write a custom fitness function.
type varselect_gafit.m
function errorRate = varselect_gafit(variableIndices,...
data,groups,numSamples)

% Create ten-fold cross-validation indices
crossValidationIndices = mod(1:numSamples,10);
% Initialise a count of the errors so far
errorsSoFar = 0;
% Repeat for each of the ten cross-validations
for i = 0:9
% Create a test and training set
testIndices = (crossValidationIndices == i);
trainIndices = ~testIndices;
% Generate a classifier to discriminate groups in the data
classes = classify(...
data(testIndices,variableIndices),...
data(trainIndices,variableIndices),...
groups(trainIndices,:));
% Find classification errors and update the errors so far
errors = sum(classes ~= groups(testIndices,:));
errorsSoFar = errorsSoFar+errors;
end
% Extract the final error rate
errorRate = errorsSoFar/numSamples;
This function uses the classify tool from the Statistics Toolbox to discriminate between two groups (lymph node status negative and positive), using the subset of variables currently being evaluated. The error rate of the classification model generated is estimated using 10-fold cross-validation. The estimated error rate is the fitness function minimized by the genetic algorithm.
We also need to customize the function used to generate an initial population for the genetic algorithm.
type varselect_gacreate.m
function population = varselect_gacreate(numSelectedVariables,FitnessFcn,...
options,numVariables)

% Repeat for each member of the population
for i = 1:options.PopulationSize
% Initialise a flag to indicate that all selected variables are
% distinct from each other
allVariablesUnique = 0;
% Loop continuously until all selected variables are unique
while ~allVariablesUnique
% Randomly select a subset of variable indices
variables = floor(rand(numSelectedVariables,1)*numVariables)+1;
% Check that all variables are different. If so, set the flag
% to break out of the loop, otherwise loop again
if size(variables) == size(unique(variables))
allVariablesUnique = 1;
end
end
% Add the current variable subset to the population
population(i,:) = variables;
end This function randomly selects a subset of variables from the dataset. If the variables are all different, the subset is included in the initial population for the genetic algorithm. If not, the function generates another random subset until an initial population of the desired size has been created.
Lastly, we need to make small changes to the default crossover and mutation functions to ensure that they always return subsets of variables that do not contain repeated variables. To access these functions, enter the following commands:
type varselect_crossoverscattered.m
type varselect_mutationuniform.m
Setting Algorithm Options

We can set various options regarding the performance of the genetic algorithm, for example, the population size, the proportion of each new generation that is generated by crossover and by mutation, the criteria used to halt the algorithm (in terms of number of generations or time), and the command-line and graphical outputs of the algorithm.
options = gaoptimset('CreationFcn',{@varselect_gacreate,numVariables},...
'CrossoverFcn',@varselect_crossoverscattered,...
'MutationFcn',@varselect_mutationuniform,...
'PopulationSize',100,...
'PopInitRange',[1;numVariables],...
'FitnessLimit', 0,...
'StallGenLimit',100,...
'StallTimeLimit',600,...
'TimeLimit',600,...
'Generations',100,...
'CrossoverFraction',0.5,...
'Display','iter',...
'PlotFcn',@gaplotbestf); Running the Genetic Algorithm

We can now run the genetic algorithm to search for the subset of genes that best predicts patient lymph node status.
numSelectedVars = 10;
FitnessFcn = {@varselect_gafit,signal,lnNodeStatus,numSamples};

[selectedVars, errorRate,reason,output] = ga(FitnessFcn, ...
numSelectedVars,options);
selectedVars = Columns 1 through 5

1149 868 929 920 1170

Columns 6 through 10

792 1050 556 680 458
errorRate =
0.0225 The algorithm takes approximately 1,000 seconds to complete on a Pentium M processor with 1GB of RAM.
Figure 2 shows the graphical output of the algorithm. Note that, as the genetic algorithm involves a random component, the results will not always exactly match this output.
Figure 2. The minimum and mean error rate of variable subsets in the current population, by generation. Click on image to see enlarged view.
Evaluating and Interpreting the Results

The selected subset of 10 genes predicts the lymph node status of patients with only a 2.25% error rate. However, this is probably an underestimate of the true error rate, as the subset of variables has been selected by searching through thousands of possible subsets, and it is likely that some will prove accurate predictors just by chance. To provide a more accurate estimate of the likely error rate of the model on unseen cases, we would ideally want to validate the predictions on a test set of unseen data.
The genes also require biological validation to confirm that they are possible candidates for biomarkers of lymph node status. We can display the descriptions of the selected genes to get some idea of their function.
selectedDescriptions = descriptions(selectedVars)';
for i=1:10
disp(selectedDescriptions{i}(15:80)) end
AL109682:Homo sapiens mRNA full length insert cDNA clone EUROIMAGE
J02940:Human platelet glycoprotein Ib alpha chain mRNA, complete c
AF052145:Homo sapiens clone 24400 mRNA sequence /cds=UNKNOWN /gb=A
AF007833:Homo sapiens kruppel-related zinc finger protein hcKrox m
H55746:140A6 Homo sapiens cDNA /gb=H55746 /gi=1004390 /ug=Hs.28704
X97249:H.sapiens mRNA for leucine-rich primary response protein 1
W29040:55d6 Homo sapiens cDNA /gb=W29040 /gi=1308997 /ug=Hs.199320
AF054177:Homo sapiens chromodomain-helicase-DNA-binding protein mR
AA079018:zm94e12.s1 Homo sapiens cDNA, 3 end /clone=IMAGE-545614
J04755:Human ferritin H processed pseudogene, complete cds /cds=UN To investigate further, we can submit the selected genes as a batch query to NetAffx, a publicly accessible database made available by Affymetrix that contains details of the genes on the microarrays. (Access to this Website requires a free registration with Affymetrix.) We create a batch query file, suitable for submission to the NetAffx Website, containing the names of the genes.
selectedGeneNames=geneNames(selectedVars)';
fid = fopen('query.txt','w');
for i = 1:numSelectedVars fprintf(fid, '%s\n',selectedGeneNames{i,:}); end
fclose(fid);
type('query.txt');
34538_at
33063_at
33601_at
33592_at
34559_at
32987_at
34081_at
31914_at
32396_f_at
31697_s_at We submit the file query.txt directly to the NetAffx Website as a batch query. Figure 3 shows the query results.
Figure 3. Results of a batch query to NetAffx on the selected genes. Click on image to see enlarged view.
We can also obtain more details on each gene by submitting individual queries on them to NetAffx (you must be logged in to NetAffx for this to work). Figure 4 shows a sample result from a query.
for i = 1:numSelectedVars query=sprintf('https://www.affymetrix.com
/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2%%3A%s',selectedGeneNames{i,:});
web(query, '-new') end Figure 4. Example details of a selected gene. Click on image to see enlarged view.
Many areas of science now generate huge volumes of data that present visualization, modeling, and interpretation challenges. Methods for selecting small but highly relevant and predictive data subsets are therefore receiving much attention. In this article, we demonstrated the use of genetic algorithms, from the Genetic Algorithms and Direct Search Toolbox, coupled with predictive modeling methods from the Statistics Toolbox, to select potential biomarkers of breast tumor severity from a microarray dataset.

更多...
  回复时引用此帖
回复


发帖规则
不可以发表新主题
不可以发表回复
不可以上传附件
不可以编辑自己的帖子

启用 BB 代码
论坛启用 表情符号
论坛启用 [IMG] 代码
论坛启用 HTML 代码



所有时间均为北京时间。现在的时间是 19:48


Powered by vBulletin
版权所有 ©2000 - 2025,Jelsoft Enterprises Ltd.