This example shows how to create a dataset
array from heterogeneous variables existing in the MATLAB workspace.
Create a dataset array from a subset of the workspace variables.
When creating the dataset array, you do not need to enter variable
names.
automatically uses the name of each
workspace variable.
, contains
a collection of variables with heterogeneous data types.
is
a character array, and the other variables are numeric.
Display the first five observations in the dataset array.
.
.
for
the number of cylinders. Convert
.
Display the country of origin and number of cylinders for the
first 15 cars.
This example shows how to create a dataset
array from the contents of a tab-delimited text file.
Navigate to the folder containing sample data. Import the text
file
as a dataset array using
the default settings.
uses the first row of
the text file for variable names. If the first row does not contain
variable names, you can specify the optional name-value pair argument
to
change the default behavior.
The dataset array contains heterogeneous variables. The variables
are cell arrays of strings, and the other
variables are numeric.
You can see the data type and other descriptive statistics for
each variable by using
to summarize the
dataset array.
Import the text file again, this time specifying that the first
column contains observation names.
The elements of the first column in the text file, last names,
are now observation names. Observation names and row names are dataset
array properties. You can always add or change the observation names
of an existing dataset array by modifying the property
.
as the descriptor of the
observation (row) dimension.
got this name
from the first row of the first column in the text file.
.
You can use observation names to index into a dataset array.
For example, return the data for the patient with last name
.
Note that observation names must be unique.
Create a Dataset Array from an Excel File
This example shows how to create a dataset
array from the contents of an Excel
® spreadsheet file.
Create a dataset array.
Navigate to the folder containing sample data. Import the data
from the first worksheet in the file
hospitalSmall.xlsx,
specifying that the data file is an Excel spreadsheet.
cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('XLSFile','hospitalSmall.xlsx')
ds =
id name sex age wgt smoke
'YPL-320' 'SMITH' 'm' 38 176 1
'GLI-532' 'JOHNSON' 'm' 43 163 0
'PNI-258' 'WILLIAMS' 'f' 38 131 0
'MIJ-579' 'JONES' 'f' 40 133 0
'XLK-030' 'BROWN' 'f' 49 119 0
'TFP-518' 'DAVIS' 'f' 46 142 0
'LPD-746' 'MILLER' 'f' 33 142 1
'ATA-945' 'WILSON' 'm' 40 180 0
'VNL-702' 'MOORE' 'm' 28 183 0
'LQW-768' 'TAYLOR' 'f' 31 132 0
'QFY-472' 'ANDERSON' 'f' 45 128 0
'UJG-627' 'THOMAS' 'f' 42 137 0
'XUE-826' 'JACKSON' 'm' 25 174 0
'TRW-072' 'WHITE' 'm' 39 202 1
By default,
dataset creates variable names
using the contents of the first row in the spreadsheet.
Specify which worksheet to import.
Import the data from the second worksheet into a new dataset
array.
ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet',2)
ds2 =
id name sex age wgt smoke
'TRW-072' 'WHITE' 'm' 39 202 1
'ELG-976' 'HARRIS' 'f' 36 129 0
'KOQ-996' 'MARTIN' 'm' 48 181 1
'YUZ-646' 'THOMPSON' 'm' 32 191 1
'XBR-291' 'GARCIA' 'f' 27 131 1
'KPW-846' 'MARTINEZ' 'm' 37 179 0
'XBA-581' 'ROBINSON' 'm' 50 172 0
'BKD-785' 'CLARK' 'f' 48 133 0
Load sample data.
Navigate to the folder containing sample data. Import the data
from the first worksheet in
hospitalSmall.xlsx into
a dataset array.
cd(matlabroot)
cd(
'help/toolbox/stats/examples')
ds = dataset(
'XLSFile',
'hospitalSmall.xlsx');
size(ds)ans =
14 6The dataset array,
ds, has 14 observations
(rows) and 6 variables (columns).
Add observations by concatenation.
The second worksheet in
hospitalSmall.xlsx has
additional patient data. Append the observations in this spreadsheet
to the end of
ds.
ds2 = dataset(
'XLSFile',
'hospitalSmall.xlsx',
'Sheet',2);
dsNew = [ds;ds2];
size(dsNew)ans =
22 6The dataset array
dsNew has 22 observations.
In order to vertically concatenate two dataset arrays, both arrays
must have the same number of variables, with the same variable names.
Add observations from a cell array.
If you want to append new observations stored in a cell array,
first convert the cell array to a dataset array, and then concatenate
the dataset arrays.
cellObs = {
'id',
'name',
'sex',
'age',
'wgt',
'smoke';
'YQR-965',
'BAKER',
'M',36,160,0;
'LFG-497',
'WALL' ,
'F',28,125,1;
'KSD-003',
'REED' ,
'M',32,187,0};
dsNew = [dsNew;cell2dataset(cellObs)];
size(dsNew)ans =
25 6
Add observations from a structure.
You can also append new observations stored in a structure.
Convert the structure to a dataset array, and then concatenate the
dataset arrays.
structObs(1,1).id =
'GHK-842';
structObs(1,1).name =
'GEORGE';
structObs(1,1).sex =
'M';
structObs(1,1).age = 45;
structObs(1,1).wgt = 182;
structObs(1,1).smoke = 1;
structObs(2,1).id =
'QRH-308';
structObs(2,1).name =
'BAILEY';
structObs(2,1).sex =
'F';
structObs(2,1).age = 29;
structObs(2,1).wgt = 120;
structObs(2,1).smoke = 0;
dsNew = [dsNew;struct2dataset(structObs)];
size(dsNew)ans =
27 6
Delete duplicate observations.
Use
unique to delete any observations in
a dataset array that are duplicated.
dsNew = unique(dsNew);
size(dsNew)ans =
21 6
One duplicated observation is deleted.
Delete observations by observation number.
Delete observations 18, 20, and 21 from the dataset array.
dsNew([18,20,21],:) = [];
size(dsNew)ans =
18 6The dataset array has only 18 observations now.
Delete observations by observation name.
First, specify the variable of identifiers,
id,
as observation names. Then, delete the variable
id from
dsNew.
You can use the observation name to index observations.
dsNew.Properties.ObsNames = dsNew.id;
dsNew.id = [];
dsNew(
'KOQ-996',:) = [];
size(dsNew)ans =
17 5The dataset array now has one less observation and one less
variable.
Search for observations to delete.
You can also search for observations in the dataset array. For
example, delete observations for any patients with the last name
WILLIAMS.
toDelete = strcmp(dsNew.name,
'WILLIAMS');
dsNew(toDelete,:) = [];
size(dsNew)ans =
16 5The dataset array now has one less observation.
Load sample data.
Navigate to the folder containing sample data.
cd(matlabroot)
cd('help/toolbox/stats/examples')
Import the data from the first worksheet in
hospitalSmall.xlsx into
a dataset array.
ds = dataset('XLSFile','hospitalSmall.xlsx');
size(ds)
ans =
14 6
The dataset array,
ds, has 14 observations
(rows) and 6 variables (columns).
Add variables by concatenating dataset arrays.
The worksheet
Heights in
hospitalSmall.xlsx has
heights for the patients on the first worksheet. Concatenate the data
in this spreadsheet with
ds.
ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights');
ds = [ds ds2];
size(ds)
ans =
14 7
The dataset array now has seven variables. You can only horizontally
concatenate dataset arrays with observations in the same position,
or with the same observation names.
ds.Properties.VarNames{end}
ans =
hgt
The name of the last variable in
ds is
hgt,
which
dataset read from the first row of the imported
spreadsheet.
Delete variables by variable name.
First, specify the unique identifiers in the variable
id as
observation names. Then, delete the variable
id from
the dataset array.
ds.Properties.ObsNames = ds.id;
ds.id = [];
size(ds)
ans =
14 6
The dataset array now has six variables. List the variable names.
ds.Properties.VarNames(:)
ans =
'name'
'sex'
'age'
'wgt'
'smoke'
'hgt'
There is no longer a variable called
id.
Add a new variable by name.
Add a new variable,
bmi—which contains
the body mass index (BMI) for each patient—to the dataset array.
BMI is a function of height and weight. Display the last name, gender,
and BMI for each patient.
ds.bmi = ds.wgt*703./ds.hgt.^2;
ds(:,{'name','sex','bmi'})
ans =
name sex bmi
YPL-320 'SMITH' 'm' 24.544
GLI-532 'JOHNSON' 'm' 24.068
PNI-258 'WILLIAMS' 'f' 23.958
MIJ-579 'JONES' 'f' 25.127
XLK-030 'BROWN' 'f' 21.078
TFP-518 'DAVIS' 'f' 27.729
LPD-746 'MILLER' 'f' 26.828
ATA-945 'WILSON' 'm' 24.41
VNL-702 'MOORE' 'm' 27.822
LQW-768 'TAYLOR' 'f' 22.655
QFY-472 'ANDERSON' 'f' 23.409
UJG-627 'THOMAS' 'f' 25.883
XUE-826 'JACKSON' 'm' 24.265
TRW-072 'WHITE' 'm' 29.827
The operators
./ and
.^ in
the calculation of BMI indicate element-wise division and exponentiation,
respectively.
Delete variables by variable number.
Delete the variable
wgt, the fourth variable
in the dataset array.
ds(:,4) = [];
ds.Properties.VarNames(:)
ans =
'name'
'sex'
'age'
'smoke'
'hgt'
'bmi'
The variable
wgt is deleted from the dataset
array.
Access variables by name.
You can access variable data, or select a subset of variables,
by using variable (column) names and dot indexing. Load a sample dataset
array. Display the names of the variables in
hospital.
load('hospital')
hospital.Properties.VarNames(:)
ans =
'LastName'
'Sex'
'Age'
'Weight'
'Smoker'
'BloodPressure'
'Trials'
The dataset array has 7 variables (columns) and 100 observations
(rows). You can double-click
hospital in the Workspace
window to view the dataset array in the Variables editor.
Plot histogram.
Plot a histogram of the data in the variable
Weight.
figure()
hist(hospital.Weight)
The histogram shows that the weight distribution is bimodal.
Plot data grouped by category.
Draw box plots of
Weight grouped by the values
in
Sex (Male and Female). That is, use the variable
Sex as
a grouping variable.
figure()
boxplot(hospital.Weight,hospital.Sex)
The box plot suggests that gender accounts for the bimodality
in weight.
Select a subset of variables.
Create a new dataset array with only the variables
LastName,
Sex,
and
Weight. You can access the variables by name
or column number.
ds1 = hospital(:,{'LastName','Sex','Weight'});
ds2 = hospital(:,[1,2,4]);
The dataset arrays
ds1 and
ds2 are
equivalent. Use parentheses
( ) when indexing dataset
arrays to preserve the data type; that is, to create a dataset array
from a subset of a dataset array. You can also use the Variables editor
to create a new dataset array from a subset of variables and observations.
Convert the variable data type.
Convert the data type of the variable
Smoker from
logical to nominal with labels
No and
Yes.
hospital.Smoker = nominal(hospital.Smoker,{'No','Yes'});
class(hospital.Smoker)
ans =
nominal
Explore data.
Display the first 10 elements of
Smoker.
hospital.Smoker(1:10)
ans =
Yes
No
No
No
No
No
Yes
No
No
No
If you want to change the level labels in a nominal array, use
setlabels.
Add variables.
The variable
BloodPressure is a 100-by-2
array. The first column corresponds to systolic blood pressure, and
the second column to diastolic blood pressure. Separate this array
into two new variables,
SysPressure and
DiaPressure.
hospital.SysPressure = hospital.BloodPressure(:,1);
hospital.DiaPressure = hospital.BloodPressure(:,2);
hospital.Properties.VarNames(:)
ans =
'LastName'
'Sex'
'Age'
'Weight'
'Smoker'
'BloodPressure'
'Trials'
'SysPressure'
'DiaPressure'
The dataset array,
hospital, has two new
variables.
Search for variables by name.
Use
regexp to find variables in
hospital with
the string
'Pressure' in their name. Create a new
dataset array containing only these variables.
bp = regexp(hospital.Properties.VarNames,'Pressure');
bpIdx = cellfun(@isempty,bp);
bpData = hospital(:,~bpIdx);
bpData.Properties.VarNames(:)
ans =
'BloodPressure'
'SysPressure'
'DiaPressure'
The new dataset array,
bpData, contains only
the blood pressure variables.
Delete variables.
Delete the variable
BloodPressure from the
dataset array,
hospital.
hospital.BloodPressure = [];
hospital.Properties.VarNames(:)
ans =
'LastName'
'Sex'
'Age'
'Weight'
'Smoker'
'Trials'
'SysPressure'
'DiaPressure'
The variable
BloodPressure is no longer in
the dataset array.
Load sample data.
Load the sample dataset array,
hospital.
Dataset arrays can have observation (row) names. This array has observation
names corresponding to unique patient identifiers.
load('hospital')
hospital.Properties.ObsNames(1:10)
ans =
'YPL-320'
'GLI-532'
'PNI-258'
'MIJ-579'
'XLK-030'
'TFP-518'
'LPD-746'
'ATA-945'
'VNL-702'
'LQW-768'
These are the first 10 observation names.
Index an observation by name.
You can use the observation names to index into the dataset
array. For example, extract the last name, sex, and age for the patient
with identifier
XLK-030.
hospital('XLK-030',{'LastName','Sex','Age'})
ans =
LastName Sex Age
XLK-030 'BROWN' Female 49
Index a subset of observations by number.
Create a new dataset array containing the first 50 patients.
ds50 = hospital(1:50,:);
size(ds50)
ans =
50 7
Search observations using a logical condition.
Create a new dataset array containing only male patients. To
find the male patients, use a logical condition to search the variable
containing gender information.
dsMale = hospital(hospital.Sex=='Male',:);
dsMale(1:10,{'LastName','Sex'})
ans =
LastName Sex
YPL-320 'SMITH' Male
GLI-532 'JOHNSON' Male
ATA-945 'WILSON' Male
VNL-702 'MOORE' Male
XUE-826 'JACKSON' Male
TRW-072 'WHITE' Male
KOQ-996 'MARTIN' Male
YUZ-646 'THOMPSON' Male
KPW-846 'MARTINEZ' Male
XBA-581 'ROBINSON' Male
Search observations using multiple conditions.
You can use multiple conditions to search the dataset array.
For example, create a new dataset array containing only female patients
older than 40.
dsFemale = hospital(hospital.Sex=='Female' & hospital.Age > 40,:);
dsFemale(1:10,{'LastName','Sex','Age'})
ans =
LastName Sex Age
XLK-030 'BROWN' Female 49
TFP-518 'DAVIS' Female 46
QFY-472 'ANDERSON' Female 45
UJG-627 'THOMAS' Female 42
BKD-785 'CLARK' Female 48
VWL-936 'LEWIS' Female 41
AAX-056 'LEE' Female 44
AFK-336 'WRIGHT' Female 45
KKL-155 'ADAMS' Female 48
RBA-579 'SANCHEZ' Female 44
Select a random subset of observations.
Create a new dataset array containing a random subset of 20
patients from the dataset array
hospital.
rng('default') % For reproducibility
dsRandom = hospital(randsample(length(hospital),20),:);
dsRandom.Properties.ObsNames
ans =
'DAU-529'
'AGR-528'
'RBO-332'
'QOO-305'
'RVS-253'
'QEQ-082'
'EHE-616'
'HVR-372'
'KOQ-996'
'REV-997'
'PUE-347'
'LQW-768'
'YLN-495'
'HJQ-495'
'ELG-976'
'XUE-826'
'MEZ-469'
'UDS-151'
'MIJ-579'
'DGC-290'
Delete observations by name.
Delete the data for the patient with observation name
HVR-372.
hospital('HVR-372',:) = [];
size(hospital)
ans =
99 7
The dataset array has one less observation.
Sort observations in ascending order.
Load the sample dataset array,
hospital.
Sort the observations by the values in
Age, in
ascending order.
load('hospital')
dsAgeUp = sortrows(hospital,'Age');
dsAgeUp(1:10,{'LastName','Age'})
ans =
LastName Age
XUE-826 'JACKSON' 25
FZR-250 'HALL' 25
PUE-347 'YOUNG' 25
LIM-480 'HILL' 25
SCQ-914 'JAMES' 25
REV-997 'ALEXANDER' 25
XBR-291 'GARCIA' 27
VNL-702 'MOORE' 28
DTT-578 'WALKER' 28
XAX-646 'COOPER' 28
The youngest patients are age 25.
Sort observations in descending order.
Sort the observations by
Age in descending
order.
dsAgeDown = sortrows(hospital,'Age','descend');
dsAgeDown(1:10,{'LastName','Age'})
ans =
LastName Age
XBA-581 'ROBINSON' 50
DAU-529 'REED' 50
XLK-030 'BROWN' 49
FLJ-908 'STEWART' 49
GGU-691 'HUGHES' 49
MEZ-469 'GRIFFIN' 49
KOQ-996 'MARTIN' 48
BKD-785 'CLARK' 48
KKL-155 'ADAMS' 48
NSK-403 'RAMIREZ' 48
The oldest patients are age 50.
Sort observations by the values of two variables.
Sort the observations in
hospital by
Age,
and then by
LastName.
dsName = sortrows(hospital,{'Age','LastName'});
dsName(1:10,{'LastName','Age'})
ans =
LastName Age
REV-997 'ALEXANDER' 25
FZR-250 'HALL' 25
LIM-480 'HILL' 25
XUE-826 'JACKSON' 25
SCQ-914 'JAMES' 25
PUE-347 'YOUNG' 25
XBR-291 'GARCIA' 27
XAX-646 'COOPER' 28
QEQ-082 'COX' 28
NSU-424 'JENKINS' 28
Now the names are sorted alphabetically within increasing age
groups.
Sort observations in mixed order.
Sort the observations in
hospital by
Age in
an increasing order, and then by
Weight in a decreasing
order.
dsWeight = sortrows(hospital,{'Age','Weight'},{'ascend','descend'});
dsWeight(1:10,{'LastName','Age','Weight'})
ans =
LastName Age Weight
FZR-250 'HALL' 25 189
SCQ-914 'JAMES' 25 186
XUE-826 'JACKSON' 25 174
REV-997 'ALEXANDER' 25 171
LIM-480 'HILL' 25 138
PUE-347 'YOUNG' 25 114
XBR-291 'GARCIA' 27 131
NSU-424 'JENKINS' 28 189
VNL-702 'MOORE' 28 183
XAX-646 'COOPER' 28 127
This
shows that the maximum weight among patients that are age 25 is 189
lbs.
Sort observations by observation name.
Sort the observations in
hospital by the
observation names.
dsObs = sortrows(hospital,'obsnames');
dsObs(1:10,{'LastName','Age'})
ans =
LastName Age
AAX-056 'LEE' 44
AFB-271 'PEREZ' 44
AFK-336 'WRIGHT' 45
AGR-528 'SIMMONS' 45
ATA-945 'WILSON' 40
BEZ-311 'DIAZ' 45
BKD-785 'CLARK' 48
DAU-529 'REED' 50
DGC-290 'BUTLER' 38
DTT-578 'WALKER' 28
The observations are sorted by observation name in ascending
alphabetical order.
Load sample data.
Navigate to a folder containing sample data. Import the data
from the first worksheet in
hospitalSmall.xlsx into
a dataset array, then keep only a few of the variables.
cd(matlabroot)
cd('help/toolbox/stats/examples')
ds1 = dataset('XLSFile','hospitalSmall.xlsx');
ds1 = ds1(:,{'id','name','sex','age'})
ds1 =
id name sex age
'YPL-320' 'SMITH' 'm' 38
'GLI-532' 'JOHNSON' 'm' 43
'PNI-258' 'WILLIAMS' 'f' 38
'MIJ-579' 'JONES' 'f' 40
'XLK-030' 'BROWN' 'f' 49
'TFP-518' 'DAVIS' 'f' 46
'LPD-746' 'MILLER' 'f' 33
'ATA-945' 'WILSON' 'm' 40
'VNL-702' 'MOORE' 'm' 28
'LQW-768' 'TAYLOR' 'f' 31
'QFY-472' 'ANDERSON' 'f' 45
'UJG-627' 'THOMAS' 'f' 42
'XUE-826' 'JACKSON' 'm' 25
'TRW-072' 'WHITE' 'm' 39
The dataset array,
ds1, has 14 observations
(rows) and 4 variables (columns).
Import the data from the worksheet
Heights2 in
hospitalSmall.xlsx.
ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights2')
ds2 =
id hgt
'LPD-746' 61
'PNI-258' 62
'XUE-826' 71
'ATA-945' 72
'XLK-030' 63
ds2 has height measurements for a subset
of five individuals from the first dataset array,
ds1.
Merge only the matching subset of observations.
Use
join to merge the two dataset arrays,
ds1 and
ds2,
keeping only the subset of observations that are in
ds2.
JoinSmall = join(ds2,ds1)
JoinSmall =
id hgt name sex age
'LPD-746' 61 'MILLER' 'f' 33
'PNI-258' 62 'WILLIAMS' 'f' 38
'XUE-826' 71 'JACKSON' 'm' 25
'ATA-945' 72 'WILSON' 'm' 40
'XLK-030' 63 'BROWN' 'f' 49
In
JoinSmall, the variable
id only
appears once. This is because it is the key variable—the variable
that links observations between the two dataset arrays—and
has the same variable name in both
ds1 and
ds2.
Include incomplete observations in the merge.
Merge
ds1 and
ds2 keeping
all observations in the larger
ds1.
joinAll = join(ds2,ds1,'type','rightouter','mergekeys',true)
joinAll =
id hgt name sex age
'ATA-945' 72 'WILSON' 'm' 40
'GLI-532' NaN 'JOHNSON' 'm' 43
'LPD-746' 61 'MILLER' 'f' 33
'LQW-768' NaN 'TAYLOR' 'f' 31
'MIJ-579' NaN 'JONES' 'f' 40
'PNI-258' 62 'WILLIAMS' 'f' 38
'QFY-472' NaN 'ANDERSON' 'f' 45
'TFP-518' NaN 'DAVIS' 'f' 46
'TRW-072' NaN 'WHITE' 'm' 39
'UJG-627' NaN 'THOMAS' 'f' 42
'VNL-702' NaN 'MOORE' 'm' 28
'XLK-030' 63 'BROWN' 'f' 49
'XUE-826' 71 'JACKSON' 'm' 25
'YPL-320' NaN 'SMITH' 'm' 38
Each
observation in
ds1 without corresponding height
measurements in
ds2 has height value
NaN.
Also, because there is no
id value in
ds2 for
each observation in
ds1, you need to merge the
keys using the option
'MergeKeys',true. This merges
the key variable,
id.
Merge dataset arrays with different key variable names.
When using
join, it is not necessary for
the key variable to have the same name in the dataset arrays to be
merged. Import the data from the worksheet named
Heights3 in
hospitalSmall.xlsx.
ds3 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights3')
ds3 =
identifier hgt
'GLI-532' 69
'QFY-472' 62
'MIJ-579' 61
'VNL-702' 68
'XLK-030' 63
'LPD-746' 61
'TFP-518' 60
'YPL-320' 71
'ATA-945' 72
'LQW-768' 64
'PNI-258' 62
'UJG-627' 61
'XUE-826' 71
'TRW-072' 69
ds3 has height measurements for each observation
in
ds1. This dataset array has the same patient
identifiers as
ds1, but they are under the variable
name
identifier, instead of
id (and
in a different order).
Specify key variable.
You can easily change the variable name of the key variable
in
ds3 by setting
d3.Properties.VarNames or
using the Variables editor, but it is not required to perform a merge.
Instead, you can specify the name of the key variable in each dataset
array using
LeftKeys and
RightKeys.
joinDiff = join(ds3,ds1,'LeftKeys','identifier','RightKeys','id')
joinDiff =
identifier hgt name sex age
'GLI-532' 69 'JOHNSON' 'm' 43
'QFY-472' 62 'ANDERSON' 'f' 45
'MIJ-579' 61 'JONES' 'f' 40
'VNL-702' 68 'MOORE' 'm' 28
'XLK-030' 63 'BROWN' 'f' 49
'LPD-746' 61 'MILLER' 'f' 33
'TFP-518' 60 'DAVIS' 'f' 46
'YPL-320' 71 'SMITH' 'm' 38
'ATA-945' 72 'WILSON' 'm' 40
'LQW-768' 64 'TAYLOR' 'f' 31
'PNI-258' 62 'WILLIAMS' 'f' 38
'UJG-627' 61 'THOMAS' 'f' 42
'XUE-826' 71 'JACKSON' 'm' 25
'TRW-072' 69 'WHITE' 'm' 39
The merged dataset array,
joinDiff, has the
same key variable order and name as the first dataset array input
to
join,
ds3.
Load sample data.
Navigate to the folder containing sample data. Import the data
from the comma-separated text file
testScores.csv.
cd(matlabroot)
cd('help/toolbox/stats/examples')
ds = dataset('File','testScores.csv','Delimiter',',')
ds =
LastName Sex Test1 Test2 Test3 Test4
'HOWARD' 'male' 90 87 93 92
'WARD' 'male' 87 85 83 90
'TORRES' 'male' 86 85 88 86
'PETERSON' 'female' 75 80 72 77
'GRAY' 'female' 89 86 87 90
'RAMIREZ' 'female' 96 92 98 95
'JAMES' 'male' 78 75 77 77
'WATSON' 'female' 91 94 92 90
'BROOKS' 'female' 86 83 85 89
'KELLY' 'male' 79 76 82 80
Each of the 10 students has 4 test scores, displayed here in
wide format.
Perform calculations on dataset array.
With the data in this format, you can, for example, calculate
the average test score for each student. The test scores are in columns
3 to 6.
ds.TestAve = mean(double(ds(:,3:6)),2);
ds(:,{'LastName','Sex','TestAve'})
ans =
LastName Sex TestAve
'HOWARD' 'male' 90.5
'WARD' 'male' 86.25
'TORRES' 'male' 86.25
'PETERSON' 'female' 76
'GRAY' 'female' 88
'RAMIREZ' 'female' 95.25
'JAMES' 'male' 76.75
'WATSON' 'female' 91.75
'BROOKS' 'female' 85.75
'KELLY' 'male' 79.25
A new variable with average test scores is added to the dataset
array,
ds.
Reformat the dataset array into tall format.
Stack the test score variables into a new variable,
Scores.
dsTall = stack(ds,{'Test1','Test2','Test3','Test4'},...
'newDataVarName','Scores')
dsTall =
LastName Sex TestAve Scores_Indicator Scores
'HOWARD' 'male' 90.5 Test1 90
'HOWARD' 'male' 90.5 Test2 87
'HOWARD' 'male' 90.5 Test3 93
'HOWARD' 'male' 90.5 Test4 92
'WARD' 'male' 86.25 Test1 87
'WARD' 'male' 86.25 Test2 85
'WARD' 'male' 86.25 Test3 83
'WARD' 'male' 86.25 Test4 90
'TORRES' 'male' 86.25 Test1 86
'TORRES' 'male' 86.25 Test2 85
'TORRES' 'male' 86.25 Test3 88
'TORRES' 'male' 86.25 Test4 86
'PETERSON' 'female' 76 Test1 75
'PETERSON' 'female' 76 Test2 80
'PETERSON' 'female' 76 Test3 72
'PETERSON' 'female' 76 Test4 77
'GRAY' 'female' 88 Test1 89
'GRAY' 'female' 88 Test2 86
'GRAY' 'female' 88 Test3 87
'GRAY' 'female' 88 Test4 90
'RAMIREZ' 'female' 95.25 Test1 96
'RAMIREZ' 'female' 95.25 Test2 92
'RAMIREZ' 'female' 95.25 Test3 98
'RAMIREZ' 'female' 95.25 Test4 95
'JAMES' 'male' 76.75 Test1 78
'JAMES' 'male' 76.75 Test2 75
'JAMES' 'male' 76.75 Test3 77
'JAMES' 'male' 76.75 Test4 77
'WATSON' 'female' 91.75 Test1 91
'WATSON' 'female' 91.75 Test2 94
'WATSON' 'female' 91.75 Test3 92
'WATSON' 'female' 91.75 Test4 90
'BROOKS' 'female' 85.75 Test1 86
'BROOKS' 'female' 85.75 Test2 83
'BROOKS' 'female' 85.75 Test3 85
'BROOKS' 'female' 85.75 Test4 89
'KELLY' 'male' 79.25 Test1 79
'KELLY' 'male' 79.25 Test2 76
'KELLY' 'male' 79.25 Test3 82
'KELLY' 'male' 79.25 Test4 80
The original test variable names,
Test1,
Test2,
Test3,
and
Test4, appear as levels in the combined test
scores indicator variable,
Scores_Indicator.
Plot data grouped by category.
With the data in this format, you can use
Scores_Indicator as
a grouping variable, and draw box plots of test scores grouped by
test.
figure()
boxplot(dsTall.Scores,dsTall.Scores_Indicator)
Reformat the dataset array into wide format.
Reformat
dsTall back into its original wide
format.
dsWide = unstack(dsTall,'Scores','Scores_Indicator');
dsWide(:,{'LastName','Test1','Test2','Test3','Test4'})
ans =
LastName Test1 Test2 Test3 Test4
'HOWARD' 90 87 93 92
'WARD' 87 85 83 90
'TORRES' 86 85 88 86
'PETERSON' 75 80 72 77
'GRAY' 89 86 87 90
'RAMIREZ' 96 92 98 95
'JAMES' 78 75 77 77
'WATSON' 91 94 92 90
'BROOKS' 86 83 85 89
'KELLY' 79 76 82 80
The dataset array is back in wide format.
unstack reassigns
the levels of the indicator variable,
Scores_Indicator,
as variable names in the unstacked dataset array.
Load sample data.
Navigate to the folder containing sample data. Import the data
from the spreadsheet
messy.xlsx.
cd(matlabroot)
cd('help/toolbox/stats/examples')
messyData = dataset('XLSFile','messy.xlsx')
messyData =
var1 var2 var3 var4 var5
'afe1' '3' 'yes' '3' 3
'egh3' '.' 'no' '7' 7
'wth4' '3' 'yes' '3' 3
'atn2' '23' 'no' '23' 23
'arg1' '5' 'yes' '5' 5
'jre3' '34.6' 'yes' '34.6' 34.6
'wen9' '234' 'yes' '234' 234
'ple2' '2' 'no' '2' 2
'dbo8' '5' 'no' '5' 5
'oii4' '5' 'yes' '5' 5
'wnk3' '245' 'yes' '245' 245
'abk6' '563' '' '563' 563
'pnj5' '463' 'no' '463' 463
'wnn3' '6' 'no' '6' 6
'oks9' '23' 'yes' '23' 23
'wba3' '' 'yes' 'NaN' 14
'pkn4' '2' 'no' '2' 2
'adw3' '22' 'no' '22' 22
'poj2' '-99' 'yes' '-99' -99
'bas8' '23' 'no' '23' 23
'gry5' 'NA' 'yes' 'NaN' 21
When you import data from a spreadsheet,
dataset reads
any variables with nonnumeric elements as a cell array of strings.
This is why the variable
var2 is a cell array of
strings. When importing data from a text file, you have more flexibility
to specify which nonnumeric expressions to treat as missing using
the option
TreatAsEmpty.
There are many different missing data indicators in
messy.xlsx,
such as:
- Empty cells
- A period (.)
- NA
- NaN
- -99
Find observations with missing values.
Display the subset of observations that have at least one missing
value using
ismissing.
ix = ismissing(messyData,'NumericTreatAsMissing',-99,...
'StringTreatAsMissing',{'NaN','.','NA'});
messyData(any(ix,2),:)
ans =
var1 var2 var3 var4 var5
'egh3' '.' 'no' '7' 7
'abk6' '563' '' '563' 563
'wba3' '' 'yes' 'NaN' 14
'poj2' '-99' 'yes' '-99' -99
'gry5' 'NA' 'yes' 'NaN' 21
By default,
ismissing recognizes the following
missing value indicators:
- NaN for numeric arrays
- '' for string arrays
- <undefined> for categorical
arrays
Use the
NumericTreatAsMissing and
StringTreatAsMissing options
to specify other values to treat as missing.
Convert string arrays to double arrays.
You can convert the string variables that should be numeric
using
str2double.
messyData.var2 = str2double(messyData.var2);
messyData.var4 = str2double(messyData.var4)
messyData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'egh3' NaN 'no' 7 7
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'abk6' 563 '' 563 563
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'wba3' NaN 'yes' NaN 14
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'poj2' -99 'yes' -99 -99
'bas8' 23 'no' 23 23
'gry5' NaN 'yes' NaN 21
Now,
var2 and
var4 are
numeric arrays. During the conversion,
str2double replaces
the nonnumeric elements of the variables
var2 and
var4 with
the value
NaN. However, there are no changes to
the numeric missing value indicator,
-99.
When applying the same function to many dataset array variables,
it can sometimes be more convenient to use
datasetfun.
For example, to convert both
var2 and
var4 to
numeric arrays simultaneously, you can use:
messyData(:,[2,4]) = datasetfun(@str2double,messyData,'DataVars',[2,4],...
'DatasetOutput',true);
Replace missing value indicators.
Clean the data so that the missing values indicated by the code
-99 have
the standard MATLAB
® numeric missing value indicator,
NaN.
messyData = replaceWithMissing(messyData,'NumericValues',-99)
messyData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'egh3' NaN 'no' 7 7
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'abk6' 563 '' 563 563
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'wba3' NaN 'yes' NaN 14
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'poj2' NaN 'yes' NaN NaN
'bas8' 23 'no' 23 23
'gry5' NaN 'yes' NaN 21
Create a dataset array with complete observations.
Create a new dataset array that contains only the complete observations—those
without missing data.
ix = ismissing(messyData);
completeData = messyData(~any(ix,2),:)
completeData =
var1 var2 var3 var4 var5
'afe1' 3 'yes' 3 3
'wth4' 3 'yes' 3 3
'atn2' 23 'no' 23 23
'arg1' 5 'yes' 5 5
'jre3' 34.6 'yes' 34.6 34.6
'wen9' 234 'yes' 234 234
'ple2' 2 'no' 2 2
'dbo8' 5 'no' 5 5
'oii4' 5 'yes' 5 5
'wnk3' 245 'yes' 245 245
'pnj5' 463 'no' 463 463
'wnn3' 6 'no' 6 6
'oks9' 23 'yes' 23 23
'pkn4' 2 'no' 2 2
'adw3' 22 'no' 22 22
'bas8' 23 'no' 23 23
Load sample data.
Navigate to the folder containing sample data. Import the data
from the comma-separated text file
testScores.csv.
cd(matlabroot)
cd(
'help/toolbox/stats/examples')
ds = dataset(
'File',
'testScores.csv',
'Delimiter',
',')ds =
LastName Sex Test1 Test2 Test3 Test4
'HOWARD' 'male' 90 87 93 92
'WARD' 'male' 87 85 83 90
'TORRES' 'male' 86 85 88 86
'PETERSON' 'female' 75 80 72 77
'GRAY' 'female' 89 86 87 90
'RAMIREZ' 'female' 96 92 98 95
'JAMES' 'male' 78 75 77 77
'WATSON' 'female' 91 94 92 90
'BROOKS' 'female' 86 83 85 89
'KELLY' 'male' 79 76 82 80 There are 4 test scores for each of 10 students, in wide format.
Average dataset array variables.
Compute the average (mean) test score for each student in the
dataset array, and store it in a new variable,
TestAvg.
Test scores are in columns 3 to 6.
Use
double to convert the specified dataset
array variables into a numeric array. Then, calculate the mean across
the second dimension (across columns) to get the test average for
each student.
ds.TestAvg = mean(double(ds(:,3:6)),2);
ds(:,{
'LastName',
'TestAvg'})ans =
LastName TestAvg
'HOWARD' 90.5
'WARD' 86.25
'TORRES' 86.25
'PETERSON' 76
'GRAY' 88
'RAMIREZ' 95.25
'JAMES' 76.75
'WATSON' 91.75
'BROOKS' 85.75
'KELLY' 79.25
Summarize the dataset array using a grouping variable.
Compute the mean and maximum average test scores for each gender.
stats = grpstats(ds,
'Sex',{
'mean',
'max'},
'DataVars',
'TestAvg')stats =
Sex GroupCount mean_TestAvg max_TestAvg
male 'male' 5 83.8 90.5
female 'female' 5 87.35 95.25 This returns a new dataset array containing the specified summary
statistics for each level of the grouping variable,
Sex.
Replace data values.
The denominator for each test score is 100. Convert the test
score denominator to 25.
scores = double(ds(:,3:6));
newScores = scores*25/100;
ds = replacedata(ds,newScores,3:6)ds =
LastName Sex Test1 Test2 Test3 Test4 TestAvg
'HOWARD' 'male' 22.5 21.75 23.25 23 90.5
'WARD' 'male' 21.75 21.25 20.75 22.5 86.25
'TORRES' 'male' 21.5 21.25 22 21.5 86.25
'PETERSON' 'female' 18.75 20 18 19.25 76
'GRAY' 'female' 22.25 21.5 21.75 22.5 88
'RAMIREZ' 'female' 24 23 24.5 23.75 95.25
'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75
'WATSON' 'female' 22.75 23.5 23 22.5 91.75
'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75
'KELLY' 'male' 19.75 19 20.5 20 79.25 The first two lines of code extract the test data and perform
the desired calculation. Then,
replacedata inserts
the new test scores back into the dataset array.
The variable of test score averages,
TestAvg,
is now the final score for each student.
Change variable name.
Change the variable name to
Final.
ds.Properties.VarNames{end} =
'Final';
dsds =
LastName Sex Test1 Test2 Test3 Test4 Final
'HOWARD' 'male' 22.5 21.75 23.25 23 90.5
'WARD' 'male' 21.75 21.25 20.75 22.5 86.25
'TORRES' 'male' 21.5 21.25 22 21.5 86.25
'PETERSON' 'female' 18.75 20 18 19.25 76
'GRAY' 'female' 22.25 21.5 21.75 22.5 88
'RAMIREZ' 'female' 24 23 24.5 23.75 95.25
'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75
'WATSON' 'female' 22.75 23.5 23 22.5 91.75
'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75
'KELLY' 'male' 19.75 19 20.5 20 79.25