热门帖子

显示标签为“data”的博文。显示所有博文
显示标签为“data”的博文。显示所有博文

2015年9月21日星期一

Decision Tree using R -- 用R 实现决策树

Tree-Based Models

Recursive partitioning is a fundamental tool in data mining. It helps us explore the stucture of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional inference trees, and random forests.

CART Modeling via rpart

Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be generated through the rpart package. Detailed information on rpart is available in An Introduction to Recursive Partitioning Using the RPART Routines. The general steps are provided below followed by two examples.

1. Grow the Tree

To grow a tree, use
rpart(formula, data=, method=,control=) where

formula is in the format
outcome
~ predictor1+predictor2+predictor3+ect.
data= specifies the data frame
method= "class" for a classification tree
"anova"
for a regression tree
control= optional parameters for controlling tree growth. For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted.

2. Examine the results

The following functions help us to examine the results.
printcp(fit) display cp table
plotcp(fit) plot cross-validation results
rsq.rpart(fit) plot approximate R-squared and relative error for different splits (2 plots). labels are only appropriate for the "anova" method.
print(fit) print results
summary(fit) detailed results including surrogate splits
plot(fit) plot decision tree
text(fit) label the decision tree plot
post(fit, file=) create postscript plot of decision tree
In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs below).

3. prune tree

Prune back the tree to avoid overfitting the data. Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ).
Prune the tree to the desired size using
prune(fit, cp= )
Specifically, use printcp( ) to examine the cross-validated error results, select the complexity parameter associated with minimum error, and place it into the prune( ) function. Alternatively, you can use the code fragment
     fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
to automatically select the complexity parameter associated with the smallest cross-validated error. Thanks to HSAUR for this idea.


Classification Tree example

Let's use the data frame kyphosis to predict a type of deformation (kyphosis) after surgery, from age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).
# Classification Tree with rpart
library(rpart)

# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
   method="class", data=kyphosis)

printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# plot tree
plot(fit, uniform=TRUE,
   main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree
post(fit, file = "c:/tree.ps",
   title = "Classification Tree for Kyphosis")

cp Plot Classification Tree Classification Tree in Postscript click to view
# prune the tree
pfit<- prune(fit, cp=   fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])

# plot the pruned tree
plot(pfit, uniform=TRUE,
   main="Pruned Classification Tree for Kyphosis")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
   title = "Pruned Classification Tree for Kyphosis")

Pruned Classificaiton Tree Pruned Classification Tree in Postscript click to view

Regression Tree example

In this example we will predict car mileage from price, country, reliability, and car type. The data frame is cu.summary.
# Regression Tree Example
library(rpart)

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
   method="anova", data=cu.summary)

printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results  

# plot tree
plot(fit, uniform=TRUE,
   main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postcript plot of tree
post(fit, file = "c:/tree2.ps",
   title = "Regression Tree for Mileage ")

cp plot for regression tree rsquare plot for regression treeregression tree Regressio Tree in Post Script click to view
# prune the tree
pfit<- prune(fit, cp=0.01160389) # from cptable

# plot the pruned tree
plot(pfit, uniform=TRUE,
   main="Pruned Regression Tree for Mileage")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree2.ps",
   title = "Pruned Regression Tree for Mileage")

It turns out that this produces the same tree as the original.

Conditional inference trees via party

The party package provides nonparametric regression trees for nominal, ordinal, numeric, censored, and multivariate responses. party: A laboratory for recursive partitioning, provides details.
You can create a regression or classification tree via the function
ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor, numeric, etc.). Tree growth is based on statistical stopping rules, so pruning should not be required.
The previous two examples are re-analyzed below.
# Conditional Inference Tree for Kyphosis
library(party)
fit <- ctree(Kyphosis ~ Age + Number + Start,
   data=kyphosis)
plot(fit, main="Conditional Inference Tree for Kyphosis")

Condiitional Inference Tree for Kyphosis click to view
# Conditional Inference Tree for Mileage
library(party)
fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
   data=na.omit(cu.summary))

Conditional Inference Tree for Mileage click to view

Random Forests

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler's random forest approach is implimented via the randomForest package.
Here is an example.
# Random Forest prediction of Kyphosis data
library(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor

For more details see the comprehensive Random Forest website.

2014年4月15日星期二

如何将C++ CPLEX数据输出到txt文件

假设使用cplex solver求解,获得optimal solution为一个三维矩阵u[i][j][k]如下:

i.j k1 k2 k3 k4 k5...
1.1  2  4  6  4  2...
1.2  3  2  1  4  5...
...              ....
j.j  1  2  3  4  5...

使用以下代码可将此三维矩阵存储到外部txt文件:
ofstream ofs; //创建一个output stream ofs
char filename1[128];  //定义字符串,长度为文件路径的字符数
sprintf(filename1, "C:/Users/allen/Dropbox/Large Scale Project/Data/Random Generate/vijk.txt"); //定义filename1为你所要导出的txt文件,后面是路径
ofs.open(filename1,ostream::app); /*以添加模式打开文件*/ 
         for(i = 0;i < nbnodes; i++){
  for(j = 0;j < nbnodes; j++){
  for(k = 0; k < nblines; k++){
  ofs << u[i][j][k] << "\t";
  }  //逐行输出
  ofs << endl;
  }
  }


C++ Cplex 定义三维变量矩阵,Define 3 dimensional Matrix

刚学的时候有各种问题,比如这次遇到的如何定义三维矩阵,找遍了GOOGLE居然都找不到一个答案,Manual上说的更是模糊,因此在这里我要详细说明一下。

首先要define你的三维矩阵,因为三维要用到二维,因此也需要定义二维矩阵。
typedef IloArray<IloIntVarArray> IntVarMatrix2; //定义二维矩阵
typedef IloArray<IloArray<IloIntVarArray> > IntVarMatrix3; //定义三维矩阵

然后定义你的三维变量,这里我的三维变量形式为V[i][j][k],其中i=1..nbnodes, j =1..nbnodes, k=1..nblines;

IntVarMatrix3 V(env,nbnodes); //定义第一维的长度
for(i = 0; i < nbnodes; i++){
V[i] = IntVarMatrix2(env, nbnodes); //定义第二维的长度
for(j = 0; j < nbnodes; j++){
V[i][j] = IloIntVarArray(env, nblines, 0, RAND_MAX); //定义第三维的长度,也就是每个V[i][j][k]的范围是0到无穷大,之前我用IloInfinity不知道为什么编译后求解x的上限为一个负的大数,因此我就改为了RAND_MAX.
}
}

2014年4月11日星期五

Export matrix in Matlab to .txt file 使用Matlab导出数据到txt文件

虽然很简单,但是还是要记录一下,在你的matlab里输入这个code:

dlmwrite(filename, M, 'delimiter', '\t', 'newline','pc');
 

这里filename代表你要输出的txt文件名称,前面加路径就可以输出到你所需要输出的那个路径的文件里,M代表你要输出的在Matlab中的矩阵名称,

‘delimiter'代表数据中间的间隔形式,后面的'\t'表示每两个数据间用tab间隔,也可以用' ',或者','分别是中间为空格和逗号,

想要导出的数据工整的话就需要加后面的 'newline','pc',这样每行数据导出会自动断点。