2015年9月21日星期一

Decision Tree using R -- 用R 实现决策树

Tree-Based Models

Recursive partitioning is a fundamental tool in data mining. It helps us explore the stucture of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional inference trees, and random forests.

CART Modeling via rpart

Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be generated through the rpart package. Detailed information on rpart is available in An Introduction to Recursive Partitioning Using the RPART Routines. The general steps are provided below followed by two examples.

1. Grow the Tree

To grow a tree, use
rpart(formula, data=, method=,control=) where

*formula*	is in the format outcome ~ predictor1+predictor2+predictor3+ect.
data=	specifies the data frame
method=	"class" for a classification tree "anova" for a regression tree
control=	optional parameters for controlling tree growth. For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted.

2. Examine the results

The following functions help us to examine the results.

printcp(fit)	display cp table
plotcp(fit)	plot cross-validation results
rsq.rpart(fit)	plot approximate R-squared and relative error for different splits (2 plots). labels are only appropriate for the "anova" method.
print(fit)	print results
summary(fit)	detailed results including surrogate splits
plot(fit)	plot decision tree
text(fit)	label the decision tree plot
post(fit, file=)	create postscript plot of decision tree

In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs below).

3. prune tree

Prune back the tree to avoid overfitting the data. Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ).
Prune the tree to the desired size using
prune(fit, cp= )
Specifically, use printcp( ) to examine the cross-validated error results, select the complexity parameter associated with minimum error, and place it into the prune( ) function. Alternatively, you can use the code fragment
fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
to automatically select the complexity parameter associated with the smallest cross-validated error. Thanks to HSAUR for this idea.

Classification Tree example

Let's use the data frame kyphosis to predict a type of deformation (kyphosis) after surgery, from age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

# Classification Tree with rpart

       library(rpart)

       

       # grow tree 

       fit <- rpart(Kyphosis ~ Age + Number + Start,

          method="class", data=kyphosis)

 

 printcp(fit) # display the results 

       plotcp(fit) # visualize cross-validation results 

        summary(fit) # detailed summary of splits

        

        # plot tree 

       plot(fit, uniform=TRUE, 

          main="Classification Tree for Kyphosis")

       text(fit, use.n=TRUE, all=TRUE, cex=.8)

       

       # create attractive postscript plot of tree 

       post(fit, file = "c:/tree.ps", 

          title = "Classification Tree for Kyphosis")

click to view


       # prune the tree 

       pfit<- prune(fit, cp=   fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])

       

       # plot the pruned tree 

       plot(pfit, uniform=TRUE, 

          main="Pruned Classification Tree for Kyphosis")

       text(pfit, use.n=TRUE, all=TRUE, cex=.8)

     post(pfit, file = "c:/ptree.ps", 

          title = "Pruned Classification Tree for Kyphosis")

Pruned Classification Tree in Postscript

click to view

Regression Tree example

In this example we will predict car mileage from price, country, reliability, and car type. The data frame is cu.summary.

# Regression Tree Example

       library(rpart)

       

 # grow tree 

 fit <- rpart(Mileage~Price + Country + Reliability + Type, 

    method="anova", data=cu.summary)

 

 printcp(fit) # display the results 

 plotcp(fit) # visualize cross-validation results
 

 summary(fit) # detailed summary of splits

 

 # create additional plots
 

        par(mfrow=c(1,2)) # two plots on one page
        

        rsq.rpart(fit) # visualize cross-validation results
         

 

 # plot tree 

 plot(fit, uniform=TRUE, 

          main="Regression Tree for Mileage ")

 text(fit, use.n=TRUE, all=TRUE, cex=.8)

 

 # create attractive postcript plot of tree 

 post(fit, file = "c:/tree2.ps", 

          title = "Regression Tree for Mileage ")

click to view

# prune the tree 

 pfit<- prune(fit, cp=0.01160389) # from cptable 

 

 # plot the pruned tree 

 plot(pfit, uniform=TRUE, 

          main="Pruned Regression Tree for Mileage")

 text(pfit, use.n=TRUE, all=TRUE, cex=.8)

 post(pfit, file = "c:/ptree2.ps", 

          title = "Pruned Regression Tree for Mileage")

It turns out that this produces the same tree as the original.

Conditional inference trees via party

The party package provides nonparametric regression trees for nominal, ordinal, numeric, censored, and multivariate responses. party: A laboratory for recursive partitioning, provides details.
You can create a regression or classification tree via the function
ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor, numeric, etc.). Tree growth is based on statistical stopping rules, so pruning should not be required.
The previous two examples are re-analyzed below.

# Conditional Inference Tree for Kyphosis

 library(party)

 fit <- ctree(Kyphosis ~ Age + Number + Start, 

    data=kyphosis)

 plot(fit, main="Conditional Inference Tree for Kyphosis")

Condiitional Inference Tree for Kyphosis

click to view

# Conditional Inference Tree for Mileage

       library(party)

 fit2 <- ctree(Mileage~Price + Country + Reliability + Type, 

    data=na.omit(cu.summary))

click to view

Random Forests

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler's random forest approach is implimented via the randomForest package.
Here is an example.

# Random Forest prediction of Kyphosis data

       library(randomForest)

       fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)

       print(fit) # view results
                 

     importance(fit) # importance of each predictor

For more details see the comprehensive Random Forest website.

2015年3月16日星期一

Latex 科技论文写作心得

说起Latex,对于一年前的我来说还是个非常恐怖的东西，因为当时对编程并不是很了解，感觉用word很直观的写作为什么非要用这个东西来编程？直到后来被逼无奈，只能边学边用，才发现对于数学公式的编辑真是WORD所不能比拟的。下面写一些我自己的心得以及上手攻略。

首先是安装。。我是用的Texdistribution+TEXMAKER来完成的，
Texdistribution可以在这里下载http://latex-project.org/ftp.html
Texmaker是文档编译器，个人觉得有很多现成的快捷键可以直接使用，而且在编辑同时还能显示效果，很好用。在这里下载http://www.xm1math.net/texmaker/

一些初级的应用我就不在这里一一详细说明了，只说一些我个人遇到过的问题

1.Reference或bibliography
对于reference，以前的流行趋势是在句子后面加[1]，代表引用的是第一个序号的文章，这样看起来比较费事，要一一对应。但是近些年很多期刊和文章开始要求使用作者+年份的格式。引用在latex里非常方便使用，所有引用的文章来源存储于一个.bib文件里，方便统一管理。以下是代码，说明赋予%后面
\documentclass[authoryear,preprint,review,12pt,round,comma]{elsarticle}

\usepackage[round]{natbib}

 asd askdhksjadh kjhkjhasd~\cite{}



% References
\section*{Reference}
\bibliographystyle{plainnat}
\bibliography{bibfile}

2014年11月18日星期二

Matlab中dataset的有关运算

Create a Dataset Array from a Numeric Array

This example shows how to create a dataset array from a numeric array existing in the MATLAB workspace.

Load sample data.

load('fisheriris')

Two variables load into the workspace: meas, a 150-by-4 numeric array, and species, a 150-by-1 cell array of strings containing species labels.

Create a dataset array.

Use mat2dataset to convert the numeric array, meas, into a dataset array.

ds = mat2dataset(meas);
ds(1:10,:)

ans = 

    meas1    meas2    meas3    meas4
    5.1      3.5      1.4      0.2  
    4.9        3      1.4      0.2  
    4.7      3.2      1.3      0.2  
    4.6      3.1      1.5      0.2  
      5      3.6      1.4      0.2  
    5.4      3.9      1.7      0.4  
    4.6      3.4      1.4      0.3  
      5      3.4      1.5      0.2  
    4.4      2.9      1.4      0.2  
    4.9      3.1      1.5      0.1

The array, meas, has four columns, so the dataset array, ds, has four variables. The default variable names are the array name, meas, with column numbers appended.
You can specify your own variable or observation names using the name-value pair arguments VarNames and ObsNames, respectively.
If you use dataset to convert a numeric array to a dataset array, by default, the resulting dataset array has one variable that is an array instead of separate variables for each column.

Examine the dataset array.

Return the size of the dataset array, ds.

size(ds)

ans =

   150     4

The dataset array, ds, is the same size as the numeric array, meas. Variable names and observation names do not factor into the size of a dataset array.

Explore dataset array metadata.

Return the metadata properties of the dataset array, ds.

ds.Properties

ans = 

       Description: ''
    VarDescription: {}
             Units: {}
          DimNames: {'Observations'  'Variables'}
          UserData: []
          ObsNames: {}
          VarNames: {'meas1'  'meas2'  'meas3'  'meas4'}

You can also access the properties individually. For example, you can retrieve the variable names using ds.Properties.VarNames.

Access data in a dataset array variable.

You can use variable names with dot indexing to access the data in a dataset array. For example, find the minimum value in the first variable, meas1.

min(ds.meas1)

ans =

    4.3000

Change variable names.

The four variables in ds are actually measurements of sepal length, sepal width, petal length, and petal width. Modify the variable names to be more descriptive.

ds.Properties.VarNames = {'SLength','SWidth','PLength','PWidth'};

Add description.

you can add a description for the dataset array.

ds.Properties.Description = 'Fisher iris data';
ds.Properties

ans = 

       Description: 'Fisher iris data'
    VarDescription: {}
             Units: {}
          DimNames: {'Observations'  'Variables'}
          UserData: []
          ObsNames: {}
          VarNames: {'SLength'  'SWidth'  'PLength'  'PWidth'}

The dataset array properties are updated with the new variable names and description.

Add a variable to the dataset array.

The variable species is a cell array of strings containing species labels. Add species to the dataset array, ds, as a nominal array named Species. Display the first five observations in the dataset array.

ds.Species = nominal(species);
ds(1:5,:)

ans = 

    SLength    SWidth    PLength    PWidth    Species
    5.1        3.5       1.4        0.2       setosa 
    4.9          3       1.4        0.2       setosa 
    4.7        3.2       1.3        0.2       setosa 
    4.6        3.1       1.5        0.2       setosa 
      5        3.6       1.4        0.2       setosa

The dataset array, ds, now has the fifth variable, Species.

Create Dataset Array from Heterogeneous Workspace Variables

This example shows how to create a dataset array from heterogeneous variables existing in the MATLAB workspace.

Load sample data.

load('carsmall')

Create a dataset array.

Create a dataset array from a subset of the workspace variables.

ds = dataset(Origin,Acceleration,Cylinders,MPG);
ds.Properties.VarNames(:)

ans = 

    'Origin'
    'Acceleration'
    'Cylinders'
    'MPG'

When creating the dataset array, you do not need to enter variable names. dataset automatically uses the name of each workspace variable.
Notice that the dataset array, ds, contains a collection of variables with heterogeneous data types. Origin is a character array, and the other variables are numeric.

Examine a dataset array.

Display the first five observations in the dataset array.

ds(1:5,:)

ans = 

    Origin     Acceleration    Cylinders    MPG
    USA          12            8            18 
    USA        11.5            8            15 
    USA          11            8            18 
    USA          12            8            16 
    USA        10.5            8            17

Apply a function to a dataset array.

Use datasetfun to return the data type of each variable in ds.

varclass = datasetfun(@class,ds,'UniformOutput',false);
varclass(:)

ans = 

    'char'
    'double'
    'double'
    'double'

You can get additional information about the variables using summary(ds).

Modify a dataset array.

Cylinders is a numeric variable that has values 4, 6, and 8 for the number of cylinders. Convert Cylinders to a nominal array with levels four, six, and eight.
Display the country of origin and number of cylinders for the first 15 cars.

ds.Cylinders = nominal(ds.Cylinders,{'four','six','eight'});
ds(1:15,{'Origin','Cylinders'})

ans = 

    Origin     Cylinders
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight    
    France     four     
    USA        eight    
    USA        eight    
    USA        eight    
    USA        eight

The variable Cylinders has a new data type.

Create a Dataset Array from a Tab-Delimited Text File

This example shows how to create a dataset array from the contents of a tab-delimited text file.

Create a dataset array using default settings.

Navigate to the folder containing sample data. Import the text file hospitalSmall.txt as a dataset array using the default settings.

cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('File','hospitalSmall.txt')

ds = 

    name              sex        age    wgt    smoke
    'SMITH'           'm'        38     176    1    
    'JOHNSON'         'm'        43     163    0    
    'WILLIAMS'        'f'        38     131    0    
    'JONES'           'f'        40     133    0    
    'BROWN'           'f'        49     119    0    
    'DAVIS'           'f'        46     142    0    
    'MILLER'          'f'        33     142    1    
    'WILSON'          'm'        40     180    0    
    'MOORE'           'm'        28     183    0    
    'TAYLOR'          'f'        31     132    0    
    'ANDERSON'        'f'        45     128    0    
    'THOMAS'          'f'        42     137    0    
    'JACKSON'         'm'        25     174    0    
    'WHITE'           'm'        39     202    1

By default, dataset uses the first row of the text file for variable names. If the first row does not contain variable names, you can specify the optional name-value pair argument 'ReadVarNames',false to change the default behavior.
The dataset array contains heterogeneous variables. The variables id, name, and sex are cell arrays of strings, and the other variables are numeric.

Summarize the dataset array.

You can see the data type and other descriptive statistics for each variable by using summary to summarize the dataset array.

summary(ds)

name: [14x1 cell string]

sex: [14x1 cell string]

age: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    25     33              39.5      43              49 

wgt: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    119    132             142       176             202

smoke: [14x1 double]

    min    1st quartile    median    3rd quartile    max
    0      0               0         0               1

Import observation names.

Import the text file again, this time specifying that the first column contains observation names.

ds = dataset('File','hospitalSmall.txt','ReadObsNames',true)

ds = 

                sex        age    wgt    smoke
    SMITH       'm'        38     176    1    
    JOHNSON     'm'        43     163    0    
    WILLIAMS    'f'        38     131    0    
    JONES       'f'        40     133    0    
    BROWN       'f'        49     119    0    
    DAVIS       'f'        46     142    0    
    MILLER      'f'        33     142    1    
    WILSON      'm'        40     180    0    
    MOORE       'm'        28     183    0    
    TAYLOR      'f'        31     132    0    
    ANDERSON    'f'        45     128    0    
    THOMAS      'f'        42     137    0    
    JACKSON     'm'        25     174    0    
    WHITE       'm'        39     202    1

The elements of the first column in the text file, last names, are now observation names. Observation names and row names are dataset array properties. You can always add or change the observation names of an existing dataset array by modifying the property ObsNames.

Change dataset array properties.

By default, the DimNames property of the dataset array has name as the descriptor of the observation (row) dimension. dataset got this name from the first row of the first column in the text file.
Change the first element of DimNames to LastName.

ds.Properties.DimNames{1} = 'LastName';
ds.Properties

ans = 

       Description: ''
    VarDescription: {}
             Units: {}
          DimNames: {'LastName'  'Variables'}
          UserData: []
          ObsNames: {14x1 cell}
          VarNames: {'sex'  'age'  'wgt'  'smoke'}

Index into dataset array.

You can use observation names to index into a dataset array. For example, return the data for the patient with last name BROWN.

ds('BROWN',:)

ans = 

             sex        age    wgt    smoke
    BROWN    'f'        49     119    0

Note that observation names must be unique.

Create a Dataset Array from a Comma-Separated Text File

This example shows how to create a dataset array from the contents of a comma-separated text file.

Create a dataset array.

Navigate to the folder containing sample data. Import the file hospitalSmall.csv as a dataset array, specifying the comma-delimited format.

cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('File','hospitalSmall.csv','Delimiter',',')

ds = 

    id               name              sex        age    wgt    smoke
    'YPL-320'        'SMITH'           'm'        38     176    1    
    'GLI-532'        'JOHNSON'         'm'        43     163    0    
    'PNI-258'        'WILLIAMS'        'f'        38     131    0    
    'MIJ-579'        'JONES'           'f'        40     133    0    
    'XLK-030'        'BROWN'           'f'        49     119    0    
    'TFP-518'        'DAVIS'           'f'        46     142    0    
    'LPD-746'        'MILLER'          'f'        33     142    1    
    'ATA-945'        'WILSON'          'm'        40     180    0    
    'VNL-702'        'MOORE'           'm'        28     183    0    
    'LQW-768'        'TAYLOR'          'f'        31     132    0    
    'QFY-472'        'ANDERSON'        'f'        45     128    0    
    'UJG-627'        'THOMAS'          'f'        42     137    0    
    'XUE-826'        'JACKSON'         'm'        25     174    0    
    'TRW-072'        'WHITE'           'm'        39     202    1

By default, dataset uses the first row in the text file as variable names.

Add observation names.

Use the unique identifiers in the variable id as observation names. Then, delete the variable id from the dataset array.

ds.Properties.ObsNames = ds.id;
ds.id = []

ds = 

               name              sex        age    wgt    smoke
    YPL-320    'SMITH'           'm'        38     176    1    
    GLI-532    'JOHNSON'         'm'        43     163    0    
    PNI-258    'WILLIAMS'        'f'        38     131    0    
    MIJ-579    'JONES'           'f'        40     133    0    
    XLK-030    'BROWN'           'f'        49     119    0    
    TFP-518    'DAVIS'           'f'        46     142    0    
    LPD-746    'MILLER'          'f'        33     142    1    
    ATA-945    'WILSON'          'm'        40     180    0    
    VNL-702    'MOORE'           'm'        28     183    0    
    LQW-768    'TAYLOR'          'f'        31     132    0    
    QFY-472    'ANDERSON'        'f'        45     128    0    
    UJG-627    'THOMAS'          'f'        42     137    0    
    XUE-826    'JACKSON'         'm'        25     174    0    
    TRW-072    'WHITE'           'm'        39     202    1

Delete observations.

Delete any patients with the last name BROWN. You can use strcmp to match the string 'BROWN' with the elements of the variable containing last names, name.

toDelete = strcmp(ds.name,'BROWN');
ds(toDelete,:) = []

ds = 

               name              sex        age    wgt    smoke
    YPL-320    'SMITH'           'm'        38     176    1    
    GLI-532    'JOHNSON'         'm'        43     163    0    
    PNI-258    'WILLIAMS'        'f'        38     131    0    
    MIJ-579    'JONES'           'f'        40     133    0    
    TFP-518    'DAVIS'           'f'        46     142    0    
    LPD-746    'MILLER'          'f'        33     142    1    
    ATA-945    'WILSON'          'm'        40     180    0    
    VNL-702    'MOORE'           'm'        28     183    0    
    LQW-768    'TAYLOR'          'f'        31     132    0    
    QFY-472    'ANDERSON'        'f'        45     128    0    
    UJG-627    'THOMAS'          'f'        42     137    0    
    XUE-826    'JACKSON'         'm'        25     174    0    
    TRW-072    'WHITE'           'm'        39     202    1

One patient having last name BROWN is deleted from the dataset array.

Return size of dataset array.

The array now has 13 observations.

size(ds)

ans =

    13     5

Note that the row and column corresponding to variable and observation names, respectively, are not included in the size of a dataset array.

Create a Dataset Array from an Excel File

This example shows how to create a dataset array from the contents of an Excel^® spreadsheet file.

Create a dataset array.

Navigate to the folder containing sample data. Import the data from the first worksheet in the file hospitalSmall.xlsx, specifying that the data file is an Excel spreadsheet.

cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('XLSFile','hospitalSmall.xlsx')

ds = 

    id               name              sex        age    wgt    smoke
    'YPL-320'        'SMITH'           'm'        38     176    1    
    'GLI-532'        'JOHNSON'         'm'        43     163    0    
    'PNI-258'        'WILLIAMS'        'f'        38     131    0    
    'MIJ-579'        'JONES'           'f'        40     133    0    
    'XLK-030'        'BROWN'           'f'        49     119    0    
    'TFP-518'        'DAVIS'           'f'        46     142    0    
    'LPD-746'        'MILLER'          'f'        33     142    1    
    'ATA-945'        'WILSON'          'm'        40     180    0    
    'VNL-702'        'MOORE'           'm'        28     183    0    
    'LQW-768'        'TAYLOR'          'f'        31     132    0    
    'QFY-472'        'ANDERSON'        'f'        45     128    0    
    'UJG-627'        'THOMAS'          'f'        42     137    0    
    'XUE-826'        'JACKSON'         'm'        25     174    0    
    'TRW-072'        'WHITE'           'm'        39     202    1

By default, dataset creates variable names using the contents of the first row in the spreadsheet.

Specify which worksheet to import.

Import the data from the second worksheet into a new dataset array.

ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet',2)

ds2 = 

    id               name              sex        age    wgt    smoke
    'TRW-072'        'WHITE'           'm'        39     202    1    
    'ELG-976'        'HARRIS'          'f'        36     129    0    
    'KOQ-996'        'MARTIN'          'm'        48     181    1    
    'YUZ-646'        'THOMPSON'        'm'        32     191    1    
    'XBR-291'        'GARCIA'          'f'        27     131    1    
    'KPW-846'        'MARTINEZ'        'm'        37     179    0    
    'XBA-581'        'ROBINSON'        'm'        50     172    0    
    'BKD-785'        'CLARK'           'f'        48     133    0

Load sample data.

Navigate to the folder containing sample data. Import the data from the first worksheet in hospitalSmall.xlsx into a dataset array.
cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('XLSFile','hospitalSmall.xlsx'); size(ds)ans = 14 6The dataset array, ds, has 14 observations (rows) and 6 variables (columns).

Add observations by concatenation.

The second worksheet in hospitalSmall.xlsx has additional patient data. Append the observations in this spreadsheet to the end of ds.
ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet',2); dsNew = [ds;ds2]; size(dsNew)ans = 22 6The dataset array dsNew has 22 observations. In order to vertically concatenate two dataset arrays, both arrays must have the same number of variables, with the same variable names.

Add observations from a cell array.

If you want to append new observations stored in a cell array, first convert the cell array to a dataset array, and then concatenate the dataset arrays.
cellObs = {'id','name','sex','age','wgt','smoke'; 'YQR-965','BAKER','M',36,160,0; 'LFG-497','WALL' ,'F',28,125,1; 'KSD-003','REED' ,'M',32,187,0}; dsNew = [dsNew;cell2dataset(cellObs)]; size(dsNew)ans = 25 6

Add observations from a structure.

You can also append new observations stored in a structure. Convert the structure to a dataset array, and then concatenate the dataset arrays.
structObs(1,1).id = 'GHK-842'; structObs(1,1).name = 'GEORGE'; structObs(1,1).sex = 'M'; structObs(1,1).age = 45; structObs(1,1).wgt = 182; structObs(1,1).smoke = 1; structObs(2,1).id = 'QRH-308'; structObs(2,1).name = 'BAILEY'; structObs(2,1).sex = 'F'; structObs(2,1).age = 29; structObs(2,1).wgt = 120; structObs(2,1).smoke = 0; dsNew = [dsNew;struct2dataset(structObs)]; size(dsNew)ans = 27 6

Delete duplicate observations.

Use unique to delete any observations in a dataset array that are duplicated.
dsNew = unique(dsNew); size(dsNew)ans = 21 6 One duplicated observation is deleted.

Delete observations by observation number.

Delete observations 18, 20, and 21 from the dataset array.
dsNew([18,20,21],:) = []; size(dsNew)ans = 18 6The dataset array has only 18 observations now.

Delete observations by observation name.

First, specify the variable of identifiers, id, as observation names. Then, delete the variable id from dsNew. You can use the observation name to index observations.
dsNew.Properties.ObsNames = dsNew.id; dsNew.id = []; dsNew('KOQ-996',:) = []; size(dsNew)ans = 17 5The dataset array now has one less observation and one less variable.

Search for observations to delete.

You can also search for observations in the dataset array. For example, delete observations for any patients with the last name WILLIAMS.
toDelete = strcmp(dsNew.name,'WILLIAMS'); dsNew(toDelete,:) = []; size(dsNew)ans = 16 5The dataset array now has one less observation.

Load sample data.

Navigate to the folder containing sample data.

cd(matlabroot)
cd('help/toolbox/stats/examples')

Import the data from the first worksheet in hospitalSmall.xlsx into a dataset array.

ds = dataset('XLSFile','hospitalSmall.xlsx');
size(ds)

ans =

    14     6

The dataset array, ds, has 14 observations (rows) and 6 variables (columns).

Add variables by concatenating dataset arrays.

The worksheet Heights in hospitalSmall.xlsx has heights for the patients on the first worksheet. Concatenate the data in this spreadsheet with ds.

ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights');
ds = [ds ds2];
size(ds)

ans =

    14     7

The dataset array now has seven variables. You can only horizontally concatenate dataset arrays with observations in the same position, or with the same observation names.

ds.Properties.VarNames{end}

ans =

hgt

The name of the last variable in ds is hgt, which dataset read from the first row of the imported spreadsheet.

Delete variables by variable name.

First, specify the unique identifiers in the variable id as observation names. Then, delete the variable id from the dataset array.

ds.Properties.ObsNames = ds.id;
ds.id = [];
size(ds)

ans =

    14     6

The dataset array now has six variables. List the variable names.

ds.Properties.VarNames(:)

ans = 

    'name'
    'sex'
    'age'
    'wgt'
    'smoke'
    'hgt'

There is no longer a variable called id.

Add a new variable by name.

Add a new variable, bmi—which contains the body mass index (BMI) for each patient—to the dataset array. BMI is a function of height and weight. Display the last name, gender, and BMI for each patient.

ds.bmi = ds.wgt*703./ds.hgt.^2;
ds(:,{'name','sex','bmi'})

ans = 

               name              sex        bmi   
    YPL-320    'SMITH'           'm'        24.544
    GLI-532    'JOHNSON'         'm'        24.068
    PNI-258    'WILLIAMS'        'f'        23.958
    MIJ-579    'JONES'           'f'        25.127
    XLK-030    'BROWN'           'f'        21.078
    TFP-518    'DAVIS'           'f'        27.729
    LPD-746    'MILLER'          'f'        26.828
    ATA-945    'WILSON'          'm'         24.41
    VNL-702    'MOORE'           'm'        27.822
    LQW-768    'TAYLOR'          'f'        22.655
    QFY-472    'ANDERSON'        'f'        23.409
    UJG-627    'THOMAS'          'f'        25.883
    XUE-826    'JACKSON'         'm'        24.265
    TRW-072    'WHITE'           'm'        29.827

The operators ./ and .^ in the calculation of BMI indicate element-wise division and exponentiation, respectively.

Delete variables by variable number.

Delete the variable wgt, the fourth variable in the dataset array.

ds(:,4) = [];
ds.Properties.VarNames(:)

ans = 

    'name'
    'sex'
    'age'
    'smoke'
    'hgt'
    'bmi'

The variable wgt is deleted from the dataset array.

Access variables by name.

You can access variable data, or select a subset of variables, by using variable (column) names and dot indexing. Load a sample dataset array. Display the names of the variables in hospital.

load('hospital')
hospital.Properties.VarNames(:)

ans = 

    'LastName'
    'Sex'
    'Age'
    'Weight'
    'Smoker'
    'BloodPressure'
    'Trials'

The dataset array has 7 variables (columns) and 100 observations (rows). You can double-click hospital in the Workspace window to view the dataset array in the Variables editor.

Plot histogram.

Plot a histogram of the data in the variable Weight.

figure()
hist(hospital.Weight)

The histogram shows that the weight distribution is bimodal.

Plot data grouped by category.

Draw box plots of Weight grouped by the values in Sex (Male and Female). That is, use the variable Sex as a grouping variable.

figure()
boxplot(hospital.Weight,hospital.Sex)

The box plot suggests that gender accounts for the bimodality in weight.

Select a subset of variables.

Create a new dataset array with only the variables LastName, Sex, and Weight. You can access the variables by name or column number.

ds1 = hospital(:,{'LastName','Sex','Weight'});
ds2 = hospital(:,[1,2,4]);

The dataset arrays ds1 and ds2 are equivalent. Use parentheses ( ) when indexing dataset arrays to preserve the data type; that is, to create a dataset array from a subset of a dataset array. You can also use the Variables editor to create a new dataset array from a subset of variables and observations.

Convert the variable data type.

Convert the data type of the variable Smoker from logical to nominal with labels No and Yes.

hospital.Smoker = nominal(hospital.Smoker,{'No','Yes'});
class(hospital.Smoker)

ans =

nominal

Explore data.

Display the first 10 elements of Smoker.

hospital.Smoker(1:10)

ans = 

     Yes 
     No 
     No 
     No 
     No 
     No 
     Yes 
     No 
     No 
     No

If you want to change the level labels in a nominal array, use setlabels.

Add variables.

The variable BloodPressure is a 100-by-2 array. The first column corresponds to systolic blood pressure, and the second column to diastolic blood pressure. Separate this array into two new variables, SysPressure and DiaPressure.

hospital.SysPressure = hospital.BloodPressure(:,1);
hospital.DiaPressure = hospital.BloodPressure(:,2);
hospital.Properties.VarNames(:)

ans = 

    'LastName'
    'Sex'
    'Age'
    'Weight'
    'Smoker'
    'BloodPressure'
    'Trials'
    'SysPressure'
    'DiaPressure'

The dataset array, hospital, has two new variables.

Search for variables by name.

Use regexp to find variables in hospital with the string 'Pressure' in their name. Create a new dataset array containing only these variables.

bp = regexp(hospital.Properties.VarNames,'Pressure');
bpIdx = cellfun(@isempty,bp);
bpData = hospital(:,~bpIdx);
bpData.Properties.VarNames(:)

ans = 

    'BloodPressure'
    'SysPressure'
    'DiaPressure'

The new dataset array, bpData, contains only the blood pressure variables.

Delete variables.

Delete the variable BloodPressure from the dataset array, hospital.

hospital.BloodPressure = [];
hospital.Properties.VarNames(:)

ans = 

    'LastName'
    'Sex'
    'Age'
    'Weight'
    'Smoker'
    'Trials'
    'SysPressure'
    'DiaPressure'

The variable BloodPressure is no longer in the dataset array.

Load sample data.

Load the sample dataset array, hospital. Dataset arrays can have observation (row) names. This array has observation names corresponding to unique patient identifiers.

load('hospital')
hospital.Properties.ObsNames(1:10)

ans = 

    'YPL-320'
    'GLI-532'
    'PNI-258'
    'MIJ-579'
    'XLK-030'
    'TFP-518'
    'LPD-746'
    'ATA-945'
    'VNL-702'
    'LQW-768'

These are the first 10 observation names.

Index an observation by name.

You can use the observation names to index into the dataset array. For example, extract the last name, sex, and age for the patient with identifier XLK-030.

hospital('XLK-030',{'LastName','Sex','Age'})

ans = 

               LastName       Sex       Age
    XLK-030    'BROWN'        Female    49

Index a subset of observations by number.

Create a new dataset array containing the first 50 patients.

ds50 = hospital(1:50,:);
size(ds50)

ans =

    50     7

Search observations using a logical condition.

Create a new dataset array containing only male patients. To find the male patients, use a logical condition to search the variable containing gender information.

dsMale = hospital(hospital.Sex=='Male',:);
dsMale(1:10,{'LastName','Sex'})

ans = 

               LastName          Sex 
    YPL-320    'SMITH'           Male
    GLI-532    'JOHNSON'         Male
    ATA-945    'WILSON'          Male
    VNL-702    'MOORE'           Male
    XUE-826    'JACKSON'         Male
    TRW-072    'WHITE'           Male
    KOQ-996    'MARTIN'          Male
    YUZ-646    'THOMPSON'        Male
    KPW-846    'MARTINEZ'        Male
    XBA-581    'ROBINSON'        Male

Search observations using multiple conditions.

You can use multiple conditions to search the dataset array. For example, create a new dataset array containing only female patients older than 40.

dsFemale = hospital(hospital.Sex=='Female' & hospital.Age > 40,:);
dsFemale(1:10,{'LastName','Sex','Age'})

ans = 

               LastName          Sex       Age
    XLK-030    'BROWN'           Female    49 
    TFP-518    'DAVIS'           Female    46 
    QFY-472    'ANDERSON'        Female    45 
    UJG-627    'THOMAS'          Female    42 
    BKD-785    'CLARK'           Female    48 
    VWL-936    'LEWIS'           Female    41 
    AAX-056    'LEE'             Female    44 
    AFK-336    'WRIGHT'          Female    45 
    KKL-155    'ADAMS'           Female    48 
    RBA-579    'SANCHEZ'         Female    44

Select a random subset of observations.

Create a new dataset array containing a random subset of 20 patients from the dataset array hospital.

rng('default') % For reproducibility
dsRandom = hospital(randsample(length(hospital),20),:);
dsRandom.Properties.ObsNames

ans = 

    'DAU-529'
    'AGR-528'
    'RBO-332'
    'QOO-305'
    'RVS-253'
    'QEQ-082'
    'EHE-616'
    'HVR-372'
    'KOQ-996'
    'REV-997'
    'PUE-347'
    'LQW-768'
    'YLN-495'
    'HJQ-495'
    'ELG-976'
    'XUE-826'
    'MEZ-469'
    'UDS-151'
    'MIJ-579'
    'DGC-290'

Delete observations by name.

Delete the data for the patient with observation name HVR-372.

hospital('HVR-372',:) = [];
size(hospital)

ans =

    99     7

The dataset array has one less observation.

Sort observations in ascending order.

Load the sample dataset array, hospital. Sort the observations by the values in Age, in ascending order.

load('hospital')
dsAgeUp = sortrows(hospital,'Age');
dsAgeUp(1:10,{'LastName','Age'})

ans = 

               LastName           Age
    XUE-826    'JACKSON'          25 
    FZR-250    'HALL'             25 
    PUE-347    'YOUNG'            25 
    LIM-480    'HILL'             25 
    SCQ-914    'JAMES'            25 
    REV-997    'ALEXANDER'        25 
    XBR-291    'GARCIA'           27 
    VNL-702    'MOORE'            28 
    DTT-578    'WALKER'           28 
    XAX-646    'COOPER'           28

The youngest patients are age 25.

Sort observations in descending order.

Sort the observations by Age in descending order.

dsAgeDown = sortrows(hospital,'Age','descend');
dsAgeDown(1:10,{'LastName','Age'})

ans = 

               LastName          Age
    XBA-581    'ROBINSON'        50 
    DAU-529    'REED'            50 
    XLK-030    'BROWN'           49 
    FLJ-908    'STEWART'         49 
    GGU-691    'HUGHES'          49 
    MEZ-469    'GRIFFIN'         49 
    KOQ-996    'MARTIN'          48 
    BKD-785    'CLARK'           48 
    KKL-155    'ADAMS'           48 
    NSK-403    'RAMIREZ'         48

The oldest patients are age 50.

Sort observations by the values of two variables.

Sort the observations in hospital by Age, and then by LastName.

dsName = sortrows(hospital,{'Age','LastName'});
dsName(1:10,{'LastName','Age'})

ans = 

               LastName           Age
    REV-997    'ALEXANDER'        25 
    FZR-250    'HALL'             25 
    LIM-480    'HILL'             25 
    XUE-826    'JACKSON'          25 
    SCQ-914    'JAMES'            25 
    PUE-347    'YOUNG'            25 
    XBR-291    'GARCIA'           27 
    XAX-646    'COOPER'           28 
    QEQ-082    'COX'              28 
    NSU-424    'JENKINS'          28

Now the names are sorted alphabetically within increasing age groups.

Sort observations in mixed order.

Sort the observations in hospital by Age in an increasing order, and then by Weight in a decreasing order.

dsWeight = sortrows(hospital,{'Age','Weight'},{'ascend','descend'});
dsWeight(1:10,{'LastName','Age','Weight'})

ans = 

               LastName           Age    Weight
    FZR-250    'HALL'             25     189   
    SCQ-914    'JAMES'            25     186   
    XUE-826    'JACKSON'          25     174   
    REV-997    'ALEXANDER'        25     171   
    LIM-480    'HILL'             25     138   
    PUE-347    'YOUNG'            25     114   
    XBR-291    'GARCIA'           27     131   
    NSU-424    'JENKINS'          28     189   
    VNL-702    'MOORE'            28     183   
    XAX-646    'COOPER'           28     127

This shows that the maximum weight among patients that are age 25 is 189 lbs.

Sort observations by observation name.

Sort the observations in hospital by the observation names.

dsObs = sortrows(hospital,'obsnames');
dsObs(1:10,{'LastName','Age'})

ans = 

               LastName         Age
    AAX-056    'LEE'            44 
    AFB-271    'PEREZ'          44 
    AFK-336    'WRIGHT'         45 
    AGR-528    'SIMMONS'        45 
    ATA-945    'WILSON'         40 
    BEZ-311    'DIAZ'           45 
    BKD-785    'CLARK'          48 
    DAU-529    'REED'           50 
    DGC-290    'BUTLER'         38 
    DTT-578    'WALKER'         28

The observations are sorted by observation name in ascending alphabetical order.

Load sample data.

Navigate to a folder containing sample data. Import the data from the first worksheet in hospitalSmall.xlsx into a dataset array, then keep only a few of the variables.

cd(matlabroot)
cd('help/toolbox/stats/examples')
ds1 = dataset('XLSFile','hospitalSmall.xlsx');
ds1 = ds1(:,{'id','name','sex','age'})

ds1 = 

    id               name              sex        age
    'YPL-320'        'SMITH'           'm'        38 
    'GLI-532'        'JOHNSON'         'm'        43 
    'PNI-258'        'WILLIAMS'        'f'        38 
    'MIJ-579'        'JONES'           'f'        40 
    'XLK-030'        'BROWN'           'f'        49 
    'TFP-518'        'DAVIS'           'f'        46 
    'LPD-746'        'MILLER'          'f'        33 
    'ATA-945'        'WILSON'          'm'        40 
    'VNL-702'        'MOORE'           'm'        28 
    'LQW-768'        'TAYLOR'          'f'        31 
    'QFY-472'        'ANDERSON'        'f'        45 
    'UJG-627'        'THOMAS'          'f'        42 
    'XUE-826'        'JACKSON'         'm'        25 
    'TRW-072'        'WHITE'           'm'        39

The dataset array, ds1, has 14 observations (rows) and 4 variables (columns).
Import the data from the worksheet Heights2 in hospitalSmall.xlsx.

ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights2')

ds2 = 

    id               hgt
    'LPD-746'        61 
    'PNI-258'        62 
    'XUE-826'        71 
    'ATA-945'        72 
    'XLK-030'        63

ds2 has height measurements for a subset of five individuals from the first dataset array, ds1.

Merge only the matching subset of observations.

Use join to merge the two dataset arrays, ds1 and ds2, keeping only the subset of observations that are in ds2.

JoinSmall = join(ds2,ds1)

JoinSmall = 

    id               hgt    name              sex        age
    'LPD-746'        61     'MILLER'          'f'        33 
    'PNI-258'        62     'WILLIAMS'        'f'        38 
    'XUE-826'        71     'JACKSON'         'm'        25 
    'ATA-945'        72     'WILSON'          'm'        40 
    'XLK-030'        63     'BROWN'           'f'        49

In JoinSmall, the variable id only appears once. This is because it is the key variable—the variable that links observations between the two dataset arrays—and has the same variable name in both ds1 and ds2.

Include incomplete observations in the merge.

Merge ds1 and ds2 keeping all observations in the larger ds1.

joinAll = join(ds2,ds1,'type','rightouter','mergekeys',true)

joinAll = 

    id               hgt    name              sex        age
    'ATA-945'         72    'WILSON'          'm'        40 
    'GLI-532'        NaN    'JOHNSON'         'm'        43 
    'LPD-746'         61    'MILLER'          'f'        33 
    'LQW-768'        NaN    'TAYLOR'          'f'        31 
    'MIJ-579'        NaN    'JONES'           'f'        40 
    'PNI-258'         62    'WILLIAMS'        'f'        38 
    'QFY-472'        NaN    'ANDERSON'        'f'        45 
    'TFP-518'        NaN    'DAVIS'           'f'        46 
    'TRW-072'        NaN    'WHITE'           'm'        39 
    'UJG-627'        NaN    'THOMAS'          'f'        42 
    'VNL-702'        NaN    'MOORE'           'm'        28 
    'XLK-030'         63    'BROWN'           'f'        49 
    'XUE-826'         71    'JACKSON'         'm'        25 
    'YPL-320'        NaN    'SMITH'           'm'        38

Each observation in ds1 without corresponding height measurements in ds2 has height value NaN. Also, because there is no id value in ds2 for each observation in ds1, you need to merge the keys using the option 'MergeKeys',true. This merges the key variable, id.

Merge dataset arrays with different key variable names.

When using join, it is not necessary for the key variable to have the same name in the dataset arrays to be merged. Import the data from the worksheet named Heights3 in hospitalSmall.xlsx.

ds3 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights3')

ds3 = 

    identifier       hgt
    'GLI-532'        69 
    'QFY-472'        62 
    'MIJ-579'        61 
    'VNL-702'        68 
    'XLK-030'        63 
    'LPD-746'        61 
    'TFP-518'        60 
    'YPL-320'        71 
    'ATA-945'        72 
    'LQW-768'        64 
    'PNI-258'        62 
    'UJG-627'        61 
    'XUE-826'        71 
    'TRW-072'        69

ds3 has height measurements for each observation in ds1. This dataset array has the same patient identifiers as ds1, but they are under the variable name identifier, instead of id (and in a different order).

Specify key variable.

You can easily change the variable name of the key variable in ds3 by setting d3.Properties.VarNames or using the Variables editor, but it is not required to perform a merge. Instead, you can specify the name of the key variable in each dataset array using LeftKeys and RightKeys.

joinDiff = join(ds3,ds1,'LeftKeys','identifier','RightKeys','id')

joinDiff = 

    identifier       hgt    name              sex        age
    'GLI-532'        69     'JOHNSON'         'm'        43 
    'QFY-472'        62     'ANDERSON'        'f'        45 
    'MIJ-579'        61     'JONES'           'f'        40 
    'VNL-702'        68     'MOORE'           'm'        28 
    'XLK-030'        63     'BROWN'           'f'        49 
    'LPD-746'        61     'MILLER'          'f'        33 
    'TFP-518'        60     'DAVIS'           'f'        46 
    'YPL-320'        71     'SMITH'           'm'        38 
    'ATA-945'        72     'WILSON'          'm'        40 
    'LQW-768'        64     'TAYLOR'          'f'        31 
    'PNI-258'        62     'WILLIAMS'        'f'        38 
    'UJG-627'        61     'THOMAS'          'f'        42 
    'XUE-826'        71     'JACKSON'         'm'        25 
    'TRW-072'        69     'WHITE'           'm'        39

The merged dataset array, joinDiff, has the same key variable order and name as the first dataset array input to join, ds3.

Load sample data.

Navigate to the folder containing sample data. Import the data from the comma-separated text file testScores.csv.

cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('File','testScores.csv','Delimiter',',')

ds = 

    LastName          Sex             Test1    Test2    Test3    Test4
    'HOWARD'          'male'          90       87       93       92   
    'WARD'            'male'          87       85       83       90   
    'TORRES'          'male'          86       85       88       86   
    'PETERSON'        'female'        75       80       72       77   
    'GRAY'            'female'        89       86       87       90   
    'RAMIREZ'         'female'        96       92       98       95   
    'JAMES'           'male'          78       75       77       77   
    'WATSON'          'female'        91       94       92       90   
    'BROOKS'          'female'        86       83       85       89   
    'KELLY'           'male'          79       76       82       80

Each of the 10 students has 4 test scores, displayed here in wide format.

Perform calculations on dataset array.

With the data in this format, you can, for example, calculate the average test score for each student. The test scores are in columns 3 to 6.

ds.TestAve = mean(double(ds(:,3:6)),2);
ds(:,{'LastName','Sex','TestAve'})

ans = 

    LastName          Sex             TestAve
    'HOWARD'          'male'           90.5  
    'WARD'            'male'          86.25  
    'TORRES'          'male'          86.25  
    'PETERSON'        'female'           76  
    'GRAY'            'female'           88  
    'RAMIREZ'         'female'        95.25  
    'JAMES'           'male'          76.75  
    'WATSON'          'female'        91.75  
    'BROOKS'          'female'        85.75  
    'KELLY'           'male'          79.25

A new variable with average test scores is added to the dataset array, ds.

Reformat the dataset array into tall format.

Stack the test score variables into a new variable, Scores.

 dsTall = stack(ds,{'Test1','Test2','Test3','Test4'},...
            'newDataVarName','Scores')

dsTall = 

    LastName          Sex             TestAve    Scores_Indicator    Scores
    'HOWARD'          'male'           90.5      Test1               90    
    'HOWARD'          'male'           90.5      Test2               87    
    'HOWARD'          'male'           90.5      Test3               93    
    'HOWARD'          'male'           90.5      Test4               92    
    'WARD'            'male'          86.25      Test1               87    
    'WARD'            'male'          86.25      Test2               85    
    'WARD'            'male'          86.25      Test3               83    
    'WARD'            'male'          86.25      Test4               90    
    'TORRES'          'male'          86.25      Test1               86    
    'TORRES'          'male'          86.25      Test2               85    
    'TORRES'          'male'          86.25      Test3               88    
    'TORRES'          'male'          86.25      Test4               86    
    'PETERSON'        'female'           76      Test1               75    
    'PETERSON'        'female'           76      Test2               80    
    'PETERSON'        'female'           76      Test3               72    
    'PETERSON'        'female'           76      Test4               77    
    'GRAY'            'female'           88      Test1               89    
    'GRAY'            'female'           88      Test2               86    
    'GRAY'            'female'           88      Test3               87    
    'GRAY'            'female'           88      Test4               90    
    'RAMIREZ'         'female'        95.25      Test1               96    
    'RAMIREZ'         'female'        95.25      Test2               92    
    'RAMIREZ'         'female'        95.25      Test3               98    
    'RAMIREZ'         'female'        95.25      Test4               95    
    'JAMES'           'male'          76.75      Test1               78    
    'JAMES'           'male'          76.75      Test2               75    
    'JAMES'           'male'          76.75      Test3               77    
    'JAMES'           'male'          76.75      Test4               77    
    'WATSON'          'female'        91.75      Test1               91    
    'WATSON'          'female'        91.75      Test2               94    
    'WATSON'          'female'        91.75      Test3               92    
    'WATSON'          'female'        91.75      Test4               90    
    'BROOKS'          'female'        85.75      Test1               86    
    'BROOKS'          'female'        85.75      Test2               83    
    'BROOKS'          'female'        85.75      Test3               85    
    'BROOKS'          'female'        85.75      Test4               89    
    'KELLY'           'male'          79.25      Test1               79    
    'KELLY'           'male'          79.25      Test2               76    
    'KELLY'           'male'          79.25      Test3               82    
    'KELLY'           'male'          79.25      Test4               80

The original test variable names, Test1, Test2, Test3, and Test4, appear as levels in the combined test scores indicator variable, Scores_Indicator.

Plot data grouped by category.

With the data in this format, you can use Scores_Indicator as a grouping variable, and draw box plots of test scores grouped by test.

figure()
boxplot(dsTall.Scores,dsTall.Scores_Indicator)

Reformat the dataset array into wide format.

Reformat dsTall back into its original wide format.

dsWide = unstack(dsTall,'Scores','Scores_Indicator');
dsWide(:,{'LastName','Test1','Test2','Test3','Test4'})

ans = 

    LastName          Test1    Test2    Test3    Test4
    'HOWARD'          90       87       93       92   
    'WARD'            87       85       83       90   
    'TORRES'          86       85       88       86   
    'PETERSON'        75       80       72       77   
    'GRAY'            89       86       87       90   
    'RAMIREZ'         96       92       98       95   
    'JAMES'           78       75       77       77   
    'WATSON'          91       94       92       90   
    'BROOKS'          86       83       85       89   
    'KELLY'           79       76       82       80

The dataset array is back in wide format. unstack reassigns the levels of the indicator variable, Scores_Indicator, as variable names in the unstacked dataset array.

Load sample data.

Navigate to the folder containing sample data. Import the data from the spreadsheet messy.xlsx.

cd(matlabroot)
cd('help/toolbox/stats/examples')
messyData = dataset('XLSFile','messy.xlsx')

messyData = 

    var1          var2          var3         var4          var5
    'afe1'        '3'           'yes'        '3'              3
    'egh3'        '.'           'no'         '7'              7
    'wth4'        '3'           'yes'        '3'              3
    'atn2'        '23'          'no'         '23'            23
    'arg1'        '5'           'yes'        '5'              5
    'jre3'        '34.6'        'yes'        '34.6'        34.6
    'wen9'        '234'         'yes'        '234'          234
    'ple2'        '2'           'no'         '2'              2
    'dbo8'        '5'           'no'         '5'              5
    'oii4'        '5'           'yes'        '5'              5
    'wnk3'        '245'         'yes'        '245'          245
    'abk6'        '563'         ''           '563'          563
    'pnj5'        '463'         'no'         '463'          463
    'wnn3'        '6'           'no'         '6'              6
    'oks9'        '23'          'yes'        '23'            23
    'wba3'        ''            'yes'        'NaN'           14
    'pkn4'        '2'           'no'         '2'              2
    'adw3'        '22'          'no'         '22'            22
    'poj2'        '-99'         'yes'        '-99'          -99
    'bas8'        '23'          'no'         '23'            23
    'gry5'        'NA'          'yes'        'NaN'           21

When you import data from a spreadsheet, dataset reads any variables with nonnumeric elements as a cell array of strings. This is why the variable var2 is a cell array of strings. When importing data from a text file, you have more flexibility to specify which nonnumeric expressions to treat as missing using the option TreatAsEmpty.
There are many different missing data indicators in messy.xlsx, such as:

Empty cells
A period (.)
NA
NaN
-99

Find observations with missing values.

Display the subset of observations that have at least one missing value using ismissing.

ix = ismissing(messyData,'NumericTreatAsMissing',-99,...
                 'StringTreatAsMissing',{'NaN','.','NA'});
messyData(any(ix,2),:)

ans = 

    var1          var2         var3         var4         var5
    'egh3'        '.'          'no'         '7'            7 
    'abk6'        '563'        ''           '563'        563 
    'wba3'        ''           'yes'        'NaN'         14 
    'poj2'        '-99'        'yes'        '-99'        -99 
    'gry5'        'NA'         'yes'        'NaN'         21

By default, ismissing recognizes the following missing value indicators:

NaN for numeric arrays
'' for string arrays
<undefined> for categorical arrays

Use the NumericTreatAsMissing and StringTreatAsMissing options to specify other values to treat as missing.

Convert string arrays to double arrays.

You can convert the string variables that should be numeric using str2double.

messyData.var2 = str2double(messyData.var2);
messyData.var4 = str2double(messyData.var4)

messyData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'egh3'         NaN    'no'            7       7
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'abk6'         563    ''            563     563
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'wba3'         NaN    'yes'         NaN      14
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'poj2'         -99    'yes'         -99     -99
    'bas8'          23    'no'           23      23
    'gry5'         NaN    'yes'         NaN      21

Now, var2 and var4 are numeric arrays. During the conversion, str2double replaces the nonnumeric elements of the variables var2 and var4 with the value NaN. However, there are no changes to the numeric missing value indicator, -99.
When applying the same function to many dataset array variables, it can sometimes be more convenient to use datasetfun. For example, to convert both var2 and var4 to numeric arrays simultaneously, you can use:

messyData(:,[2,4]) = datasetfun(@str2double,messyData,'DataVars',[2,4],...
                     'DatasetOutput',true);

Replace missing value indicators.

Clean the data so that the missing values indicated by the code -99 have the standard MATLAB^® numeric missing value indicator, NaN.

messyData = replaceWithMissing(messyData,'NumericValues',-99)

messyData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'egh3'         NaN    'no'            7       7
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'abk6'         563    ''            563     563
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'wba3'         NaN    'yes'         NaN      14
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'poj2'         NaN    'yes'         NaN     NaN
    'bas8'          23    'no'           23      23
    'gry5'         NaN    'yes'         NaN      21

Create a dataset array with complete observations.

Create a new dataset array that contains only the complete observations—those without missing data.

ix = ismissing(messyData);
completeData = messyData(~any(ix,2),:)

completeData = 

    var1          var2    var3         var4    var5
    'afe1'           3    'yes'           3       3
    'wth4'           3    'yes'           3       3
    'atn2'          23    'no'           23      23
    'arg1'           5    'yes'           5       5
    'jre3'        34.6    'yes'        34.6    34.6
    'wen9'         234    'yes'         234     234
    'ple2'           2    'no'            2       2
    'dbo8'           5    'no'            5       5
    'oii4'           5    'yes'           5       5
    'wnk3'         245    'yes'         245     245
    'pnj5'         463    'no'          463     463
    'wnn3'           6    'no'            6       6
    'oks9'          23    'yes'          23      23
    'pkn4'           2    'no'            2       2
    'adw3'          22    'no'           22      22
    'bas8'          23    'no'           23      23

Load sample data.

Navigate to the folder containing sample data. Import the data from the comma-separated text file testScores.csv.
cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('File','testScores.csv','Delimiter',',')ds = LastName Sex Test1 Test2 Test3 Test4 'HOWARD' 'male' 90 87 93 92 'WARD' 'male' 87 85 83 90 'TORRES' 'male' 86 85 88 86 'PETERSON' 'female' 75 80 72 77 'GRAY' 'female' 89 86 87 90 'RAMIREZ' 'female' 96 92 98 95 'JAMES' 'male' 78 75 77 77 'WATSON' 'female' 91 94 92 90 'BROOKS' 'female' 86 83 85 89 'KELLY' 'male' 79 76 82 80 There are 4 test scores for each of 10 students, in wide format.

Average dataset array variables.

Compute the average (mean) test score for each student in the dataset array, and store it in a new variable, TestAvg. Test scores are in columns 3 to 6.
Use double to convert the specified dataset array variables into a numeric array. Then, calculate the mean across the second dimension (across columns) to get the test average for each student.
ds.TestAvg = mean(double(ds(:,3:6)),2); ds(:,{'LastName','TestAvg'})ans = LastName TestAvg 'HOWARD' 90.5 'WARD' 86.25 'TORRES' 86.25 'PETERSON' 76 'GRAY' 88 'RAMIREZ' 95.25 'JAMES' 76.75 'WATSON' 91.75 'BROOKS' 85.75 'KELLY' 79.25

Summarize the dataset array using a grouping variable.

Compute the mean and maximum average test scores for each gender.
stats = grpstats(ds,'Sex',{'mean','max'},'DataVars','TestAvg')stats = Sex GroupCount mean_TestAvg max_TestAvg male 'male' 5 83.8 90.5 female 'female' 5 87.35 95.25 This returns a new dataset array containing the specified summary statistics for each level of the grouping variable, Sex.

Replace data values.

The denominator for each test score is 100. Convert the test score denominator to 25.
scores = double(ds(:,3:6)); newScores = scores*25/100; ds = replacedata(ds,newScores,3:6)ds = LastName Sex Test1 Test2 Test3 Test4 TestAvg 'HOWARD' 'male' 22.5 21.75 23.25 23 90.5 'WARD' 'male' 21.75 21.25 20.75 22.5 86.25 'TORRES' 'male' 21.5 21.25 22 21.5 86.25 'PETERSON' 'female' 18.75 20 18 19.25 76 'GRAY' 'female' 22.25 21.5 21.75 22.5 88 'RAMIREZ' 'female' 24 23 24.5 23.75 95.25 'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75 'WATSON' 'female' 22.75 23.5 23 22.5 91.75 'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75 'KELLY' 'male' 19.75 19 20.5 20 79.25 The first two lines of code extract the test data and perform the desired calculation. Then, replacedata inserts the new test scores back into the dataset array.
The variable of test score averages, TestAvg, is now the final score for each student.

Change variable name.

Change the variable name to Final.
ds.Properties.VarNames{end} = 'Final'; dsds = LastName Sex Test1 Test2 Test3 Test4 Final 'HOWARD' 'male' 22.5 21.75 23.25 23 90.5 'WARD' 'male' 21.75 21.25 20.75 22.5 86.25 'TORRES' 'male' 21.5 21.25 22 21.5 86.25 'PETERSON' 'female' 18.75 20 18 19.25 76 'GRAY' 'female' 22.25 21.5 21.75 22.5 88 'RAMIREZ' 'female' 24 23 24.5 23.75 95.25 'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75 'WATSON' 'female' 22.75 23.5 23 22.5 91.75 'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75 'KELLY' 'male' 19.75 19 20.5 20 79.25

Load sample data.

load('hospital')The dataset array has 100 observations and 7 variables.

Export to a text file.

Export the dataset array, hospital, to a text file named hospital.txt. By default, export writes to a tab-delimited text file with the same name as the dataset array, appended by .txt.
export(hospital)This creates the file hospital.txt in the current working folder, if it does not previously exist. If the file already exists in the current working folder, export overwrites the existing file.
By default, variable names are in the first line of the text file. Observation names, if present, are in the first column.

Export without variable names.

Export hospital with variable names suppressed to a text file named NoLabels.txt.
export(hospital,'File','NoLabels.txt','WriteVarNames',false)There are no variable names in the first line of the created text file, NoLabels.txt.

Export to a comma-delimited format.

Export hospital to a comma-delimited text file, hospital.csv.
export(hospital,'File','hospital.csv','Delimiter',',')

Export to an Excel spreadsheet.

Export hospital to an Excel^® spreadsheet named hospital.xlsx.
export(hospital,'XLSFile','hospital.xlsx')By default, the first row of hospital.xlsx has variable names, and the first column has observation names.

Open Dataset Arrays in the Variables Editor

The MATLAB Variables editor provides a convenient interface for viewing, modifying, and plotting dataset arrays.
First, load the sample data set, hospital.

load hospital

The dataset array, hospital, is created in the MATLAB workspace.

The dataset array has 100 observations and 7 variables.
To open hospital in the Variables editor, click Open Variable, and select hospital.

The Variables editor opens, displaying the contents of the dataset array (only the first 10 observations are shown here).

In the Variables editor, you can see the names of the seven variables along the top row, and the observations names down the first column.

Modify Variable and Observation Names

You can modify variable and observation names by double-clicking a name, and then typing new text.

All changes made in the Variables editor are also sent to the command line.

The sixth variable in the data set, BloodPressure, is a numeric array with two columns. The first column shows systolic blood pressure, and the second column shows diastolic blood pressure. Click the arrow that appears on the right side of the variable name cell to see the units and description of the variable. You can type directly in the units and description fields to modify the text. The variable data type and size are shown under the variable description.

Reorder or Delete Variables

You can reorder variables in a dataset array using the Variables editor. Hover over the left side of a variable name cell until a four-headed arrow appears.

After the arrow appears, click and drag the variable column to a new location.

The command for the variable reordering appears in the command line.

You can delete a variable in the Variables editor by selecting the variable column, right-clicking, and selecting Delete Column Variable(s).

The command for the variable deletion appears in the command line.

Add New Data

You can enter new data values directly into the Variables editor. For example, you can add a new patient observation to the hospital data set. To enter a new last name, add a string to the end of the variable LastName.

The variable Gender is a nominal array. The levels of the categorical variable appear in a drop-down list when you double-click a cell in the Gender column. You can choose one of the levels previously used, or create a new level by selecting New Item.

You can continue to add data for the remaining variables.
To change the observation name, click the observation name and type the new name.

The commands for entering the new data appear at the command line.

Notice the warning that appears after the first assignment. When you enter the first piece of data in the new observation row—here, the last name—default values are assigned to all other variables. Default assignments are:

0 for numeric variables
<undefined> for categorical variables
[] for cell arrays

You can also copy and paste data from one dataset array to another using the Variables editor.

Sort Observations

You can use the Variables editor to sort dataset array observations by the values of one or more variables. To sort by gender, for example, select the variable Gender. Then click Sort, and choose to sort rows by ascending or descending values of the selected variable.

When sorting by variables that are cell arrays of strings or of nominal data type, observations are sorted alphabetically. For ordinal variables, rows are sorted by the ordering of the levels. For example, when the observations of hospital are sorted by the values in Gender, the females are grouped together, followed by the males.

To sort by the values of multiple variables, press Ctrl while you select multiple variables.
When you use the Variables editor to sort rows, it is the same as calling sortrows. You can see this at the command line after executing the sorting.

Select a Subset of Data

You can select a subset of data from a dataset array in the Variables editor, and create a new dataset array from the selection. For example, to create a dataset array containing only the variables LastName and Age:

Hold Ctrl while you click the variables LastName and Age.
Right-click, and select New Workspace Variable from Selection > New Dataset Array.

The new dataset array appears in the Workspace window with the name hospital1. The Command Window shows the commands that execute the selection.

You can use the same steps to select any subset of data. To select observations according to some logical condition, you can use a combination of sorting and selecting. For example, to create a new dataset array containing only males aged 45 and older:

Sort the observations of hospital by the values in Gender and Age, descending.
Select the male observations with age 45 and older.
Right-click, and select New Workspace Variables from Selection > New Dataset Array. The new dataset array, hospital2, is created in the Workspace window.
You can rename the dataset array in the Workspace window.

Create Plots

You can plot data from a dataset array using plotting options in the Variables editor. Available plot choices depend on the data types of variables to be plotted.
For example, if you select the variable Age, you can see in the Plots tab some plotting options that are appropriate for a univariate, numeric variable.

Sometimes, there are plot options for multiple variables, depending on their data types. For example, if you select both Age and Gender, you can draw box plots of age, grouped by gender.

Ways To Index and Search

There are many ways to index into dataset arrays. For example, for a dataset array, ds, you can:

Use () to create a new dataset array from a subset of ds. For example, ds1 = ds(1:5,:) creates a new dataset array, ds1, consisting of the first five rows of ds. Metadata, including variable and observation names, transfers to the new dataset array.
Use variable names with dot notation to index individual variables in a dataset array. For example, ds.Height indexes the variable named Height.
Use observation names to index individual observations in a dataset array. For example, ds('Obs1',:) gives data for the observation named Obs1.
Use observation or variable numbers. For example, ds(:,[1,3,5]) gives the data in the first, third, and fifth variables (columns) of ds.
Use logical indexing to search for observations in ds that satisfy a logical condition. For example, ds(ds.Gender=='Male',:) gives the observations in ds where the variable named Gender, a nominal array, has the value Male.
Use ismissing to find missing data in the dataset array.

Examples

Common Indexing and Searching Methods

This example shows several indexing and searching methods for categorical arrays.
Load the sample data.

load hospital;
size(hospital)

ans =

   100     7

The dataset array has 100 observations and 7 variables.
Index a variable by name. Return the minimum age in the dataset array.

min(hospital.Age)

ans =

    25

Delete the variable Trials.

hospital.Trials = [];
size(hospital)

ans =

   100     6

Index an observation by name. Display measurements on the first five variables for the observation named PUE-347.

hospital('PUE-347',1:5)

ans = 

               LastName       Sex       Age    Weight    Smoker
    PUE-347    'YOUNG'        Female    25     114       false

Index variables by number. Create a new dataset array containing the first four variables of hospital.

dsNew = hospital(:,1:4);
dsNew.Properties.VarNames(:)

ans = 

    'LastName'
    'Sex'
    'Age'
    'Weight'

Index observations by number. Delete the last 10 observations.

hospital(end-9:end,:) = [];
size(hospital)

ans =

    90     6

Search for observations by logical condition. Create a new dataset array containing only females who smoke.

dsFS = hospital(hospital.Sex=='Female' & hospital.Smoker==true,:);
dsFS(:,{'LastName','Sex','Smoker'})

ans = 

               LastName           Sex       Smoker
    LPD-746    'MILLER'           Female    true  
    XBR-291    'GARCIA'           Female    true  
    AAX-056    'LEE'              Female    true  
    DTT-578    'WALKER'           Female    true  
    AFK-336    'WRIGHT'           Female    true  
    RBA-579    'SANCHEZ'          Female    true  
    HAK-381    'MORRIS'           Female    true  
    NSK-403    'RAMIREZ'          Female    true  
    ILS-109    'WATSON'           Female    true  
    JDR-456    'SANDERS'          Female    true  
    HWZ-321    'PATTERSON'        Female    true  
    GGU-691    'HUGHES'           Female    true  
    WUS-105    'FLORES'           Female    true

订阅：博文 (Atom)

Ran's Tough Ph.D Road

热门帖子

2015年9月21日星期一

Decision Tree using R -- 用R 实现决策树

Tree-Based Models

CART Modeling via rpart

1. Grow the Tree

2. Examine the results

3. prune tree

Classification Tree example

Regression Tree example

Conditional inference trees via party

Random Forests

2015年3月16日星期一

Latex 科技论文写作心得

2014年11月18日星期二

Matlab中dataset的有关运算

Create a Dataset Array from a Numeric Array

Create Dataset Array from Heterogeneous Workspace Variables

Create a Dataset Array from a Tab-Delimited Text File

Create a Dataset Array from a Comma-Separated Text File

Create a Dataset Array from an Excel File

Open Dataset Arrays in the Variables Editor

Modify Variable and Observation Names

Reorder or Delete Variables

Add New Data

Sort Observations

Select a Subset of Data

Create Plots

Ways To Index and Search

Examples

Common Indexing and Searching Methods