Create a Dataset Array from a Numeric Array
This example shows how to create a dataset array from a numeric array existing in the MATLAB workspace.
Load sample data.
load('fisheriris')
Two variables load into the workspace: meas,
a 150-by-4 numeric array, and species, a 150-by-1
cell array of strings containing species labels.
Create a dataset array.
Use mat2dataset to convert the numeric array, meas,
into a dataset array.ds = mat2dataset(meas); ds(1:10,:)
ans = meas1 meas2 meas3 meas4 5.1 3.5 1.4 0.2 4.9 3 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1The array, meas, has four columns, so the dataset array, ds, has four variables. The default variable names are the array name, meas, with column numbers appended.
You can specify your own variable or observation names using the name-value pair arguments VarNames and ObsNames, respectively.
If you use dataset to convert a numeric array to a dataset array, by default, the resulting dataset array has one variable that is an array instead of separate variables for each column.
Examine the dataset array.
Return the size of the dataset array, ds.size(ds)
ans = 150 4The dataset array, ds, is the same size as the numeric array, meas. Variable names and observation names do not factor into the size of a dataset array.
Explore dataset array metadata.
Return the metadata properties of the dataset array, ds.ds.Properties
ans = Description: '' VarDescription: {} Units: {} DimNames: {'Observations' 'Variables'} UserData: [] ObsNames: {} VarNames: {'meas1' 'meas2' 'meas3' 'meas4'}You can also access the properties individually. For example, you can retrieve the variable names using ds.Properties.VarNames.
Access data in a dataset array variable.
You can use variable names with dot indexing to access the data
in a dataset array. For example, find the minimum value in the first
variable, meas1.min(ds.meas1)
ans = 4.3000
Change variable names.
The four variables in ds are actually measurements
of sepal length, sepal width, petal length, and petal width. Modify
the variable names to be more descriptive.ds.Properties.VarNames = {'SLength','SWidth','PLength','PWidth'};
Add description.
you can add a description for the dataset array.ds.Properties.Description = 'Fisher iris data';
ds.Properties
ans = Description: 'Fisher iris data' VarDescription: {} Units: {} DimNames: {'Observations' 'Variables'} UserData: [] ObsNames: {} VarNames: {'SLength' 'SWidth' 'PLength' 'PWidth'}The dataset array properties are updated with the new variable names and description.
Add a variable to the dataset array.
The variable species is a cell array of strings
containing species labels. Add species to the dataset
array, ds, as a nominal array named Species.
Display the first five observations in the dataset array.ds.Species = nominal(species); ds(1:5,:)
ans = SLength SWidth PLength PWidth Species 5.1 3.5 1.4 0.2 setosa 4.9 3 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 5 3.6 1.4 0.2 setosaThe dataset array, ds, now has the fifth variable, Species.
Create Dataset Array from Heterogeneous Workspace Variables
This example shows how to create a dataset array from heterogeneous variables existing in the MATLAB workspace.
Load sample data.
load('carsmall')
Create a dataset array.
Create a dataset array from a subset of the workspace variables.ds = dataset(Origin,Acceleration,Cylinders,MPG); ds.Properties.VarNames(:)
ans = 'Origin' 'Acceleration' 'Cylinders' 'MPG'When creating the dataset array, you do not need to enter variable names. dataset automatically uses the name of each workspace variable.
Notice that the dataset array, ds, contains a collection of variables with heterogeneous data types. Origin is a character array, and the other variables are numeric.
Examine a dataset array.
Display the first five observations in the dataset array.ds(1:5,:)
ans = Origin Acceleration Cylinders MPG USA 12 8 18 USA 11.5 8 15 USA 11 8 18 USA 12 8 16 USA 10.5 8 17
Apply a function to a dataset array.
Use datasetfun to return the data type of
each variable in ds.varclass = datasetfun(@class,ds,'UniformOutput',false);
varclass(:)
ans = 'char' 'double' 'double' 'double'You can get additional information about the variables using summary(ds).
Modify a dataset array.
Cylinders is a numeric variable that has
values 4, 6, and 8 for
the number of cylinders. Convert Cylinders to a
nominal array with levels four, six,
and eight.Display the country of origin and number of cylinders for the first 15 cars.
ds.Cylinders = nominal(ds.Cylinders,{'four','six','eight'}); ds(1:15,{'Origin','Cylinders'})
ans = Origin Cylinders USA eight USA eight USA eight USA eight USA eight USA eight USA eight USA eight USA eight USA eight France four USA eight USA eight USA eight USA eightThe variable Cylinders has a new data type.
Create a Dataset Array from a Tab-Delimited Text File
This example shows how to create a dataset array from the contents of a tab-delimited text file.
Create a dataset array using default settings.
Navigate to the folder containing sample data. Import the text
file hospitalSmall.txt as a dataset array using
the default settings.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('File','hospitalSmall.txt')
ds = name sex age wgt smoke 'SMITH' 'm' 38 176 1 'JOHNSON' 'm' 43 163 0 'WILLIAMS' 'f' 38 131 0 'JONES' 'f' 40 133 0 'BROWN' 'f' 49 119 0 'DAVIS' 'f' 46 142 0 'MILLER' 'f' 33 142 1 'WILSON' 'm' 40 180 0 'MOORE' 'm' 28 183 0 'TAYLOR' 'f' 31 132 0 'ANDERSON' 'f' 45 128 0 'THOMAS' 'f' 42 137 0 'JACKSON' 'm' 25 174 0 'WHITE' 'm' 39 202 1By default, dataset uses the first row of the text file for variable names. If the first row does not contain variable names, you can specify the optional name-value pair argument 'ReadVarNames',false to change the default behavior.
The dataset array contains heterogeneous variables. The variables id, name, and sex are cell arrays of strings, and the other variables are numeric.
Summarize the dataset array.
You can see the data type and other descriptive statistics for
each variable by using summary to summarize the
dataset array.summary(ds)
name: [14x1 cell string] sex: [14x1 cell string] age: [14x1 double] min 1st quartile median 3rd quartile max 25 33 39.5 43 49 wgt: [14x1 double] min 1st quartile median 3rd quartile max 119 132 142 176 202 smoke: [14x1 double] min 1st quartile median 3rd quartile max 0 0 0 0 1
Import observation names.
Import the text file again, this time specifying that the first
column contains observation names.ds = dataset('File','hospitalSmall.txt','ReadObsNames',true)
ds = sex age wgt smoke SMITH 'm' 38 176 1 JOHNSON 'm' 43 163 0 WILLIAMS 'f' 38 131 0 JONES 'f' 40 133 0 BROWN 'f' 49 119 0 DAVIS 'f' 46 142 0 MILLER 'f' 33 142 1 WILSON 'm' 40 180 0 MOORE 'm' 28 183 0 TAYLOR 'f' 31 132 0 ANDERSON 'f' 45 128 0 THOMAS 'f' 42 137 0 JACKSON 'm' 25 174 0 WHITE 'm' 39 202 1The elements of the first column in the text file, last names, are now observation names. Observation names and row names are dataset array properties. You can always add or change the observation names of an existing dataset array by modifying the property ObsNames.
Change dataset array properties.
By default, the DimNames property of the
dataset array has name as the descriptor of the
observation (row) dimension. dataset got this name
from the first row of the first column in the text file. Change the first element of DimNames to LastName.
ds.Properties.DimNames{1} = 'LastName';
ds.Properties
ans = Description: '' VarDescription: {} Units: {} DimNames: {'LastName' 'Variables'} UserData: [] ObsNames: {14x1 cell} VarNames: {'sex' 'age' 'wgt' 'smoke'}
Index into dataset array.
You can use observation names to index into a dataset array.
For example, return the data for the patient with last name BROWN.ds('BROWN',:)
ans = sex age wgt smoke BROWN 'f' 49 119 0Note that observation names must be unique.
Create a Dataset Array from a Comma-Separated Text File
This example shows how to create a dataset array from the contents of a comma-separated text file.
Create a dataset array.
Navigate to the folder containing sample data. Import the file hospitalSmall.csv as
a dataset array, specifying the comma-delimited format.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('File','hospitalSmall.csv','Delimiter',',')
ds = id name sex age wgt smoke 'YPL-320' 'SMITH' 'm' 38 176 1 'GLI-532' 'JOHNSON' 'm' 43 163 0 'PNI-258' 'WILLIAMS' 'f' 38 131 0 'MIJ-579' 'JONES' 'f' 40 133 0 'XLK-030' 'BROWN' 'f' 49 119 0 'TFP-518' 'DAVIS' 'f' 46 142 0 'LPD-746' 'MILLER' 'f' 33 142 1 'ATA-945' 'WILSON' 'm' 40 180 0 'VNL-702' 'MOORE' 'm' 28 183 0 'LQW-768' 'TAYLOR' 'f' 31 132 0 'QFY-472' 'ANDERSON' 'f' 45 128 0 'UJG-627' 'THOMAS' 'f' 42 137 0 'XUE-826' 'JACKSON' 'm' 25 174 0 'TRW-072' 'WHITE' 'm' 39 202 1By default, dataset uses the first row in the text file as variable names.
Add observation names.
Use the unique identifiers in the variable id as
observation names. Then, delete the variable id from
the dataset array.ds.Properties.ObsNames = ds.id; ds.id = []
ds = name sex age wgt smoke YPL-320 'SMITH' 'm' 38 176 1 GLI-532 'JOHNSON' 'm' 43 163 0 PNI-258 'WILLIAMS' 'f' 38 131 0 MIJ-579 'JONES' 'f' 40 133 0 XLK-030 'BROWN' 'f' 49 119 0 TFP-518 'DAVIS' 'f' 46 142 0 LPD-746 'MILLER' 'f' 33 142 1 ATA-945 'WILSON' 'm' 40 180 0 VNL-702 'MOORE' 'm' 28 183 0 LQW-768 'TAYLOR' 'f' 31 132 0 QFY-472 'ANDERSON' 'f' 45 128 0 UJG-627 'THOMAS' 'f' 42 137 0 XUE-826 'JACKSON' 'm' 25 174 0 TRW-072 'WHITE' 'm' 39 202 1
Delete observations.
Delete any patients with the last name BROWN.
You can use strcmp to match the string 'BROWN' with
the elements of the variable containing last names, name.toDelete = strcmp(ds.name,'BROWN');
ds(toDelete,:) = []
ds = name sex age wgt smoke YPL-320 'SMITH' 'm' 38 176 1 GLI-532 'JOHNSON' 'm' 43 163 0 PNI-258 'WILLIAMS' 'f' 38 131 0 MIJ-579 'JONES' 'f' 40 133 0 TFP-518 'DAVIS' 'f' 46 142 0 LPD-746 'MILLER' 'f' 33 142 1 ATA-945 'WILSON' 'm' 40 180 0 VNL-702 'MOORE' 'm' 28 183 0 LQW-768 'TAYLOR' 'f' 31 132 0 QFY-472 'ANDERSON' 'f' 45 128 0 UJG-627 'THOMAS' 'f' 42 137 0 XUE-826 'JACKSON' 'm' 25 174 0 TRW-072 'WHITE' 'm' 39 202 1One patient having last name BROWN is deleted from the dataset array.
Return size of dataset array.
The array now has 13 observations.size(ds)
ans = 13 5Note that the row and column corresponding to variable and observation names, respectively, are not included in the size of a dataset array.
Create a Dataset Array from an Excel File
This example shows how to create a dataset array from the contents of an Excel® spreadsheet file.
Create a dataset array.
Navigate to the folder containing sample data. Import the data
from the first worksheet in the file hospitalSmall.xlsx,
specifying that the data file is an Excel spreadsheet.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('XLSFile','hospitalSmall.xlsx')
ds = id name sex age wgt smoke 'YPL-320' 'SMITH' 'm' 38 176 1 'GLI-532' 'JOHNSON' 'm' 43 163 0 'PNI-258' 'WILLIAMS' 'f' 38 131 0 'MIJ-579' 'JONES' 'f' 40 133 0 'XLK-030' 'BROWN' 'f' 49 119 0 'TFP-518' 'DAVIS' 'f' 46 142 0 'LPD-746' 'MILLER' 'f' 33 142 1 'ATA-945' 'WILSON' 'm' 40 180 0 'VNL-702' 'MOORE' 'm' 28 183 0 'LQW-768' 'TAYLOR' 'f' 31 132 0 'QFY-472' 'ANDERSON' 'f' 45 128 0 'UJG-627' 'THOMAS' 'f' 42 137 0 'XUE-826' 'JACKSON' 'm' 25 174 0 'TRW-072' 'WHITE' 'm' 39 202 1By default, dataset creates variable names using the contents of the first row in the spreadsheet.
Specify which worksheet to import.
Import the data from the second worksheet into a new dataset
array.ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet',2)
ds2 = id name sex age wgt smoke 'TRW-072' 'WHITE' 'm' 39 202 1 'ELG-976' 'HARRIS' 'f' 36 129 0 'KOQ-996' 'MARTIN' 'm' 48 181 1 'YUZ-646' 'THOMPSON' 'm' 32 191 1 'XBR-291' 'GARCIA' 'f' 27 131 1 'KPW-846' 'MARTINEZ' 'm' 37 179 0 'XBA-581' 'ROBINSON' 'm' 50 172 0 'BKD-785' 'CLARK' 'f' 48 133 0
Load sample data.
Navigate to the folder containing sample data. Import the data
from the first worksheet in hospitalSmall.xlsx into
a dataset array.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('XLSFile','hospitalSmall.xlsx'); size(ds)ans = 14 6The dataset array, ds, has 14 observations (rows) and 6 variables (columns).
Add observations by concatenation.
The second worksheet in hospitalSmall.xlsx has
additional patient data. Append the observations in this spreadsheet
to the end of ds.ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet',2); dsNew = [ds;ds2]; size(dsNew)ans = 22 6The dataset array dsNew has 22 observations. In order to vertically concatenate two dataset arrays, both arrays must have the same number of variables, with the same variable names.
Add observations from a cell array.
If you want to append new observations stored in a cell array,
first convert the cell array to a dataset array, and then concatenate
the dataset arrays.cellObs = {'id','name','sex','age','wgt','smoke'; 'YQR-965','BAKER','M',36,160,0; 'LFG-497','WALL' ,'F',28,125,1; 'KSD-003','REED' ,'M',32,187,0}; dsNew = [dsNew;cell2dataset(cellObs)]; size(dsNew)ans = 25 6
Add observations from a structure.
You can also append new observations stored in a structure.
Convert the structure to a dataset array, and then concatenate the
dataset arrays.structObs(1,1).id = 'GHK-842'; structObs(1,1).name = 'GEORGE'; structObs(1,1).sex = 'M'; structObs(1,1).age = 45; structObs(1,1).wgt = 182; structObs(1,1).smoke = 1; structObs(2,1).id = 'QRH-308'; structObs(2,1).name = 'BAILEY'; structObs(2,1).sex = 'F'; structObs(2,1).age = 29; structObs(2,1).wgt = 120; structObs(2,1).smoke = 0; dsNew = [dsNew;struct2dataset(structObs)]; size(dsNew)ans = 27 6
Delete duplicate observations.
Use unique to delete any observations in
a dataset array that are duplicated.dsNew = unique(dsNew); size(dsNew)ans = 21 6 One duplicated observation is deleted.
Delete observations by observation number.
Delete observations 18, 20, and 21 from the dataset array.dsNew([18,20,21],:) = []; size(dsNew)ans = 18 6The dataset array has only 18 observations now.
Delete observations by observation name.
First, specify the variable of identifiers, id,
as observation names. Then, delete the variable id from dsNew.
You can use the observation name to index observations.dsNew.Properties.ObsNames = dsNew.id; dsNew.id = []; dsNew('KOQ-996',:) = []; size(dsNew)ans = 17 5The dataset array now has one less observation and one less variable.
Search for observations to delete.
You can also search for observations in the dataset array. For
example, delete observations for any patients with the last name WILLIAMS.toDelete = strcmp(dsNew.name,'WILLIAMS'); dsNew(toDelete,:) = []; size(dsNew)ans = 16 5The dataset array now has one less observation.
Load sample data.
Navigate to the folder containing sample data.cd(matlabroot)
cd('help/toolbox/stats/examples')
Import the data from the first worksheet in hospitalSmall.xlsx into
a dataset array.ds = dataset('XLSFile','hospitalSmall.xlsx'); size(ds)
ans = 14 6The dataset array, ds, has 14 observations (rows) and 6 variables (columns).
Add variables by concatenating dataset arrays.
The worksheet Heights in hospitalSmall.xlsx has
heights for the patients on the first worksheet. Concatenate the data
in this spreadsheet with ds.ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights'); ds = [ds ds2]; size(ds)
ans = 14 7The dataset array now has seven variables. You can only horizontally concatenate dataset arrays with observations in the same position, or with the same observation names.
ds.Properties.VarNames{end}
ans = hgtThe name of the last variable in ds is hgt, which dataset read from the first row of the imported spreadsheet.
Delete variables by variable name.
First, specify the unique identifiers in the variable id as
observation names. Then, delete the variable id from
the dataset array.ds.Properties.ObsNames = ds.id; ds.id = []; size(ds)
ans = 14 6The dataset array now has six variables. List the variable names.
ds.Properties.VarNames(:)
ans = 'name' 'sex' 'age' 'wgt' 'smoke' 'hgt'There is no longer a variable called id.
Add a new variable by name.
Add a new variable, bmi—which contains
the body mass index (BMI) for each patient—to the dataset array.
BMI is a function of height and weight. Display the last name, gender,
and BMI for each patient.ds.bmi = ds.wgt*703./ds.hgt.^2; ds(:,{'name','sex','bmi'})
ans = name sex bmi YPL-320 'SMITH' 'm' 24.544 GLI-532 'JOHNSON' 'm' 24.068 PNI-258 'WILLIAMS' 'f' 23.958 MIJ-579 'JONES' 'f' 25.127 XLK-030 'BROWN' 'f' 21.078 TFP-518 'DAVIS' 'f' 27.729 LPD-746 'MILLER' 'f' 26.828 ATA-945 'WILSON' 'm' 24.41 VNL-702 'MOORE' 'm' 27.822 LQW-768 'TAYLOR' 'f' 22.655 QFY-472 'ANDERSON' 'f' 23.409 UJG-627 'THOMAS' 'f' 25.883 XUE-826 'JACKSON' 'm' 24.265 TRW-072 'WHITE' 'm' 29.827The operators ./ and .^ in the calculation of BMI indicate element-wise division and exponentiation, respectively.
Delete variables by variable number.
Delete the variable wgt, the fourth variable
in the dataset array.ds(:,4) = []; ds.Properties.VarNames(:)
ans = 'name' 'sex' 'age' 'smoke' 'hgt' 'bmi'The variable wgt is deleted from the dataset array.
Access variables by name.
You can access variable data, or select a subset of variables,
by using variable (column) names and dot indexing. Load a sample dataset
array. Display the names of the variables in hospital.load('hospital')
hospital.Properties.VarNames(:)
ans = 'LastName' 'Sex' 'Age' 'Weight' 'Smoker' 'BloodPressure' 'Trials'The dataset array has 7 variables (columns) and 100 observations (rows). You can double-click hospital in the Workspace window to view the dataset array in the Variables editor.
Plot histogram.
Plot a histogram of the data in the variable Weight.figure() hist(hospital.Weight)
The histogram shows that the weight distribution is bimodal.
Plot data grouped by category.
Draw box plots of Weight grouped by the values
in Sex (Male and Female). That is, use the variable Sex as
a grouping variable.figure() boxplot(hospital.Weight,hospital.Sex)
The box plot suggests that gender accounts for the bimodality in weight.
Select a subset of variables.
Create a new dataset array with only the variables LastName, Sex,
and Weight. You can access the variables by name
or column number.ds1 = hospital(:,{'LastName','Sex','Weight'}); ds2 = hospital(:,[1,2,4]);The dataset arrays ds1 and ds2 are equivalent. Use parentheses ( ) when indexing dataset arrays to preserve the data type; that is, to create a dataset array from a subset of a dataset array. You can also use the Variables editor to create a new dataset array from a subset of variables and observations.
Convert the variable data type.
Convert the data type of the variable Smoker from
logical to nominal with labels No and Yes.hospital.Smoker = nominal(hospital.Smoker,{'No','Yes'}); class(hospital.Smoker)
ans = nominal
Explore data.
Display the first 10 elements of Smoker.hospital.Smoker(1:10)
ans = Yes No No No No No Yes No No NoIf you want to change the level labels in a nominal array, use setlabels.
Add variables.
The variable BloodPressure is a 100-by-2
array. The first column corresponds to systolic blood pressure, and
the second column to diastolic blood pressure. Separate this array
into two new variables, SysPressure and DiaPressure.hospital.SysPressure = hospital.BloodPressure(:,1); hospital.DiaPressure = hospital.BloodPressure(:,2); hospital.Properties.VarNames(:)
ans = 'LastName' 'Sex' 'Age' 'Weight' 'Smoker' 'BloodPressure' 'Trials' 'SysPressure' 'DiaPressure'The dataset array, hospital, has two new variables.
Search for variables by name.
Use regexp to find variables in hospital with
the string 'Pressure' in their name. Create a new
dataset array containing only these variables.bp = regexp(hospital.Properties.VarNames,'Pressure');
bpIdx = cellfun(@isempty,bp);
bpData = hospital(:,~bpIdx);
bpData.Properties.VarNames(:)
ans = 'BloodPressure' 'SysPressure' 'DiaPressure'The new dataset array, bpData, contains only the blood pressure variables.
Delete variables.
Delete the variable BloodPressure from the
dataset array, hospital.hospital.BloodPressure = []; hospital.Properties.VarNames(:)
ans = 'LastName' 'Sex' 'Age' 'Weight' 'Smoker' 'Trials' 'SysPressure' 'DiaPressure'The variable BloodPressure is no longer in the dataset array.
Load sample data.
Load the sample dataset array, hospital.
Dataset arrays can have observation (row) names. This array has observation
names corresponding to unique patient identifiers.load('hospital')
hospital.Properties.ObsNames(1:10)
ans = 'YPL-320' 'GLI-532' 'PNI-258' 'MIJ-579' 'XLK-030' 'TFP-518' 'LPD-746' 'ATA-945' 'VNL-702' 'LQW-768'These are the first 10 observation names.
Index an observation by name.
You can use the observation names to index into the dataset
array. For example, extract the last name, sex, and age for the patient
with identifier XLK-030.hospital('XLK-030',{'LastName','Sex','Age'})
ans = LastName Sex Age XLK-030 'BROWN' Female 49
Index a subset of observations by number.
Create a new dataset array containing the first 50 patients.ds50 = hospital(1:50,:); size(ds50)
ans = 50 7
Search observations using a logical condition.
Create a new dataset array containing only male patients. To
find the male patients, use a logical condition to search the variable
containing gender information.dsMale = hospital(hospital.Sex=='Male',:); dsMale(1:10,{'LastName','Sex'})
ans = LastName Sex YPL-320 'SMITH' Male GLI-532 'JOHNSON' Male ATA-945 'WILSON' Male VNL-702 'MOORE' Male XUE-826 'JACKSON' Male TRW-072 'WHITE' Male KOQ-996 'MARTIN' Male YUZ-646 'THOMPSON' Male KPW-846 'MARTINEZ' Male XBA-581 'ROBINSON' Male
Search observations using multiple conditions.
You can use multiple conditions to search the dataset array.
For example, create a new dataset array containing only female patients
older than 40.dsFemale = hospital(hospital.Sex=='Female' & hospital.Age > 40,:); dsFemale(1:10,{'LastName','Sex','Age'})
ans = LastName Sex Age XLK-030 'BROWN' Female 49 TFP-518 'DAVIS' Female 46 QFY-472 'ANDERSON' Female 45 UJG-627 'THOMAS' Female 42 BKD-785 'CLARK' Female 48 VWL-936 'LEWIS' Female 41 AAX-056 'LEE' Female 44 AFK-336 'WRIGHT' Female 45 KKL-155 'ADAMS' Female 48 RBA-579 'SANCHEZ' Female 44
Select a random subset of observations.
Create a new dataset array containing a random subset of 20
patients from the dataset array hospital.rng('default') % For reproducibility dsRandom = hospital(randsample(length(hospital),20),:); dsRandom.Properties.ObsNames
ans = 'DAU-529' 'AGR-528' 'RBO-332' 'QOO-305' 'RVS-253' 'QEQ-082' 'EHE-616' 'HVR-372' 'KOQ-996' 'REV-997' 'PUE-347' 'LQW-768' 'YLN-495' 'HJQ-495' 'ELG-976' 'XUE-826' 'MEZ-469' 'UDS-151' 'MIJ-579' 'DGC-290'
Delete observations by name.
Delete the data for the patient with observation name HVR-372.hospital('HVR-372',:) = [];
size(hospital)
ans = 99 7The dataset array has one less observation.
Sort observations in ascending order.
Load the sample dataset array, hospital.
Sort the observations by the values in Age, in
ascending order.load('hospital') dsAgeUp = sortrows(hospital,'Age'); dsAgeUp(1:10,{'LastName','Age'})
ans = LastName Age XUE-826 'JACKSON' 25 FZR-250 'HALL' 25 PUE-347 'YOUNG' 25 LIM-480 'HILL' 25 SCQ-914 'JAMES' 25 REV-997 'ALEXANDER' 25 XBR-291 'GARCIA' 27 VNL-702 'MOORE' 28 DTT-578 'WALKER' 28 XAX-646 'COOPER' 28The youngest patients are age 25.
Sort observations in descending order.
Sort the observations by Age in descending
order.dsAgeDown = sortrows(hospital,'Age','descend'); dsAgeDown(1:10,{'LastName','Age'})
ans = LastName Age XBA-581 'ROBINSON' 50 DAU-529 'REED' 50 XLK-030 'BROWN' 49 FLJ-908 'STEWART' 49 GGU-691 'HUGHES' 49 MEZ-469 'GRIFFIN' 49 KOQ-996 'MARTIN' 48 BKD-785 'CLARK' 48 KKL-155 'ADAMS' 48 NSK-403 'RAMIREZ' 48The oldest patients are age 50.
Sort observations by the values of two variables.
Sort the observations in hospital by Age,
and then by LastName.dsName = sortrows(hospital,{'Age','LastName'}); dsName(1:10,{'LastName','Age'})
ans = LastName Age REV-997 'ALEXANDER' 25 FZR-250 'HALL' 25 LIM-480 'HILL' 25 XUE-826 'JACKSON' 25 SCQ-914 'JAMES' 25 PUE-347 'YOUNG' 25 XBR-291 'GARCIA' 27 XAX-646 'COOPER' 28 QEQ-082 'COX' 28 NSU-424 'JENKINS' 28Now the names are sorted alphabetically within increasing age groups.
Sort observations in mixed order.
Sort the observations in hospital by Age in
an increasing order, and then by Weight in a decreasing
order.dsWeight = sortrows(hospital,{'Age','Weight'},{'ascend','descend'}); dsWeight(1:10,{'LastName','Age','Weight'})
ans = LastName Age Weight FZR-250 'HALL' 25 189 SCQ-914 'JAMES' 25 186 XUE-826 'JACKSON' 25 174 REV-997 'ALEXANDER' 25 171 LIM-480 'HILL' 25 138 PUE-347 'YOUNG' 25 114 XBR-291 'GARCIA' 27 131 NSU-424 'JENKINS' 28 189 VNL-702 'MOORE' 28 183 XAX-646 'COOPER' 28 127This shows that the maximum weight among patients that are age 25 is 189 lbs.
Sort observations by observation name.
Sort the observations in hospital by the
observation names.dsObs = sortrows(hospital,'obsnames'); dsObs(1:10,{'LastName','Age'})
ans = LastName Age AAX-056 'LEE' 44 AFB-271 'PEREZ' 44 AFK-336 'WRIGHT' 45 AGR-528 'SIMMONS' 45 ATA-945 'WILSON' 40 BEZ-311 'DIAZ' 45 BKD-785 'CLARK' 48 DAU-529 'REED' 50 DGC-290 'BUTLER' 38 DTT-578 'WALKER' 28The observations are sorted by observation name in ascending alphabetical order.
Load sample data.
Navigate to a folder containing sample data. Import the data
from the first worksheet in hospitalSmall.xlsx into
a dataset array, then keep only a few of the variables.cd(matlabroot) cd('help/toolbox/stats/examples') ds1 = dataset('XLSFile','hospitalSmall.xlsx'); ds1 = ds1(:,{'id','name','sex','age'})
ds1 = id name sex age 'YPL-320' 'SMITH' 'm' 38 'GLI-532' 'JOHNSON' 'm' 43 'PNI-258' 'WILLIAMS' 'f' 38 'MIJ-579' 'JONES' 'f' 40 'XLK-030' 'BROWN' 'f' 49 'TFP-518' 'DAVIS' 'f' 46 'LPD-746' 'MILLER' 'f' 33 'ATA-945' 'WILSON' 'm' 40 'VNL-702' 'MOORE' 'm' 28 'LQW-768' 'TAYLOR' 'f' 31 'QFY-472' 'ANDERSON' 'f' 45 'UJG-627' 'THOMAS' 'f' 42 'XUE-826' 'JACKSON' 'm' 25 'TRW-072' 'WHITE' 'm' 39The dataset array, ds1, has 14 observations (rows) and 4 variables (columns).
Import the data from the worksheet Heights2 in hospitalSmall.xlsx.
ds2 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights2')
ds2 = id hgt 'LPD-746' 61 'PNI-258' 62 'XUE-826' 71 'ATA-945' 72 'XLK-030' 63ds2 has height measurements for a subset of five individuals from the first dataset array, ds1.
Merge only the matching subset of observations.
Use join to merge the two dataset arrays, ds1 and ds2,
keeping only the subset of observations that are in ds2.JoinSmall = join(ds2,ds1)
JoinSmall = id hgt name sex age 'LPD-746' 61 'MILLER' 'f' 33 'PNI-258' 62 'WILLIAMS' 'f' 38 'XUE-826' 71 'JACKSON' 'm' 25 'ATA-945' 72 'WILSON' 'm' 40 'XLK-030' 63 'BROWN' 'f' 49In JoinSmall, the variable id only appears once. This is because it is the key variable—the variable that links observations between the two dataset arrays—and has the same variable name in both ds1 and ds2.
Include incomplete observations in the merge.
Merge ds1 and ds2 keeping
all observations in the larger ds1.joinAll = join(ds2,ds1,'type','rightouter','mergekeys',true)
joinAll = id hgt name sex age 'ATA-945' 72 'WILSON' 'm' 40 'GLI-532' NaN 'JOHNSON' 'm' 43 'LPD-746' 61 'MILLER' 'f' 33 'LQW-768' NaN 'TAYLOR' 'f' 31 'MIJ-579' NaN 'JONES' 'f' 40 'PNI-258' 62 'WILLIAMS' 'f' 38 'QFY-472' NaN 'ANDERSON' 'f' 45 'TFP-518' NaN 'DAVIS' 'f' 46 'TRW-072' NaN 'WHITE' 'm' 39 'UJG-627' NaN 'THOMAS' 'f' 42 'VNL-702' NaN 'MOORE' 'm' 28 'XLK-030' 63 'BROWN' 'f' 49 'XUE-826' 71 'JACKSON' 'm' 25 'YPL-320' NaN 'SMITH' 'm' 38Each observation in ds1 without corresponding height measurements in ds2 has height value NaN. Also, because there is no id value in ds2 for each observation in ds1, you need to merge the keys using the option 'MergeKeys',true. This merges the key variable, id.
Merge dataset arrays with different key variable names.
When using join, it is not necessary for
the key variable to have the same name in the dataset arrays to be
merged. Import the data from the worksheet named Heights3 in hospitalSmall.xlsx.ds3 = dataset('XLSFile','hospitalSmall.xlsx','Sheet','Heights3')
ds3 = identifier hgt 'GLI-532' 69 'QFY-472' 62 'MIJ-579' 61 'VNL-702' 68 'XLK-030' 63 'LPD-746' 61 'TFP-518' 60 'YPL-320' 71 'ATA-945' 72 'LQW-768' 64 'PNI-258' 62 'UJG-627' 61 'XUE-826' 71 'TRW-072' 69ds3 has height measurements for each observation in ds1. This dataset array has the same patient identifiers as ds1, but they are under the variable name identifier, instead of id (and in a different order).
Specify key variable.
You can easily change the variable name of the key variable
in ds3 by setting d3.Properties.VarNames or
using the Variables editor, but it is not required to perform a merge.
Instead, you can specify the name of the key variable in each dataset
array using LeftKeys and RightKeys.joinDiff = join(ds3,ds1,'LeftKeys','identifier','RightKeys','id')
joinDiff = identifier hgt name sex age 'GLI-532' 69 'JOHNSON' 'm' 43 'QFY-472' 62 'ANDERSON' 'f' 45 'MIJ-579' 61 'JONES' 'f' 40 'VNL-702' 68 'MOORE' 'm' 28 'XLK-030' 63 'BROWN' 'f' 49 'LPD-746' 61 'MILLER' 'f' 33 'TFP-518' 60 'DAVIS' 'f' 46 'YPL-320' 71 'SMITH' 'm' 38 'ATA-945' 72 'WILSON' 'm' 40 'LQW-768' 64 'TAYLOR' 'f' 31 'PNI-258' 62 'WILLIAMS' 'f' 38 'UJG-627' 61 'THOMAS' 'f' 42 'XUE-826' 71 'JACKSON' 'm' 25 'TRW-072' 69 'WHITE' 'm' 39The merged dataset array, joinDiff, has the same key variable order and name as the first dataset array input to join, ds3.
Load sample data.
Navigate to the folder containing sample data. Import the data
from the comma-separated text file testScores.csv.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('File','testScores.csv','Delimiter',',')
ds = LastName Sex Test1 Test2 Test3 Test4 'HOWARD' 'male' 90 87 93 92 'WARD' 'male' 87 85 83 90 'TORRES' 'male' 86 85 88 86 'PETERSON' 'female' 75 80 72 77 'GRAY' 'female' 89 86 87 90 'RAMIREZ' 'female' 96 92 98 95 'JAMES' 'male' 78 75 77 77 'WATSON' 'female' 91 94 92 90 'BROOKS' 'female' 86 83 85 89 'KELLY' 'male' 79 76 82 80Each of the 10 students has 4 test scores, displayed here in wide format.
Perform calculations on dataset array.
With the data in this format, you can, for example, calculate
the average test score for each student. The test scores are in columns
3 to 6.ds.TestAve = mean(double(ds(:,3:6)),2); ds(:,{'LastName','Sex','TestAve'})
ans = LastName Sex TestAve 'HOWARD' 'male' 90.5 'WARD' 'male' 86.25 'TORRES' 'male' 86.25 'PETERSON' 'female' 76 'GRAY' 'female' 88 'RAMIREZ' 'female' 95.25 'JAMES' 'male' 76.75 'WATSON' 'female' 91.75 'BROOKS' 'female' 85.75 'KELLY' 'male' 79.25A new variable with average test scores is added to the dataset array, ds.
Reformat the dataset array into tall format.
Stack the test score variables into a new variable, Scores.dsTall = stack(ds,{'Test1','Test2','Test3','Test4'},... 'newDataVarName','Scores')
dsTall = LastName Sex TestAve Scores_Indicator Scores 'HOWARD' 'male' 90.5 Test1 90 'HOWARD' 'male' 90.5 Test2 87 'HOWARD' 'male' 90.5 Test3 93 'HOWARD' 'male' 90.5 Test4 92 'WARD' 'male' 86.25 Test1 87 'WARD' 'male' 86.25 Test2 85 'WARD' 'male' 86.25 Test3 83 'WARD' 'male' 86.25 Test4 90 'TORRES' 'male' 86.25 Test1 86 'TORRES' 'male' 86.25 Test2 85 'TORRES' 'male' 86.25 Test3 88 'TORRES' 'male' 86.25 Test4 86 'PETERSON' 'female' 76 Test1 75 'PETERSON' 'female' 76 Test2 80 'PETERSON' 'female' 76 Test3 72 'PETERSON' 'female' 76 Test4 77 'GRAY' 'female' 88 Test1 89 'GRAY' 'female' 88 Test2 86 'GRAY' 'female' 88 Test3 87 'GRAY' 'female' 88 Test4 90 'RAMIREZ' 'female' 95.25 Test1 96 'RAMIREZ' 'female' 95.25 Test2 92 'RAMIREZ' 'female' 95.25 Test3 98 'RAMIREZ' 'female' 95.25 Test4 95 'JAMES' 'male' 76.75 Test1 78 'JAMES' 'male' 76.75 Test2 75 'JAMES' 'male' 76.75 Test3 77 'JAMES' 'male' 76.75 Test4 77 'WATSON' 'female' 91.75 Test1 91 'WATSON' 'female' 91.75 Test2 94 'WATSON' 'female' 91.75 Test3 92 'WATSON' 'female' 91.75 Test4 90 'BROOKS' 'female' 85.75 Test1 86 'BROOKS' 'female' 85.75 Test2 83 'BROOKS' 'female' 85.75 Test3 85 'BROOKS' 'female' 85.75 Test4 89 'KELLY' 'male' 79.25 Test1 79 'KELLY' 'male' 79.25 Test2 76 'KELLY' 'male' 79.25 Test3 82 'KELLY' 'male' 79.25 Test4 80The original test variable names, Test1, Test2, Test3, and Test4, appear as levels in the combined test scores indicator variable, Scores_Indicator.
Plot data grouped by category.
With the data in this format, you can use Scores_Indicator as
a grouping variable, and draw box plots of test scores grouped by
test.figure() boxplot(dsTall.Scores,dsTall.Scores_Indicator)
Reformat the dataset array into wide format.
Reformat dsTall back into its original wide
format.dsWide = unstack(dsTall,'Scores','Scores_Indicator'); dsWide(:,{'LastName','Test1','Test2','Test3','Test4'})
ans = LastName Test1 Test2 Test3 Test4 'HOWARD' 90 87 93 92 'WARD' 87 85 83 90 'TORRES' 86 85 88 86 'PETERSON' 75 80 72 77 'GRAY' 89 86 87 90 'RAMIREZ' 96 92 98 95 'JAMES' 78 75 77 77 'WATSON' 91 94 92 90 'BROOKS' 86 83 85 89 'KELLY' 79 76 82 80The dataset array is back in wide format. unstack reassigns the levels of the indicator variable, Scores_Indicator, as variable names in the unstacked dataset array.
Load sample data.
Navigate to the folder containing sample data. Import the data
from the spreadsheet messy.xlsx. cd(matlabroot) cd('help/toolbox/stats/examples') messyData = dataset('XLSFile','messy.xlsx')
messyData = var1 var2 var3 var4 var5 'afe1' '3' 'yes' '3' 3 'egh3' '.' 'no' '7' 7 'wth4' '3' 'yes' '3' 3 'atn2' '23' 'no' '23' 23 'arg1' '5' 'yes' '5' 5 'jre3' '34.6' 'yes' '34.6' 34.6 'wen9' '234' 'yes' '234' 234 'ple2' '2' 'no' '2' 2 'dbo8' '5' 'no' '5' 5 'oii4' '5' 'yes' '5' 5 'wnk3' '245' 'yes' '245' 245 'abk6' '563' '' '563' 563 'pnj5' '463' 'no' '463' 463 'wnn3' '6' 'no' '6' 6 'oks9' '23' 'yes' '23' 23 'wba3' '' 'yes' 'NaN' 14 'pkn4' '2' 'no' '2' 2 'adw3' '22' 'no' '22' 22 'poj2' '-99' 'yes' '-99' -99 'bas8' '23' 'no' '23' 23 'gry5' 'NA' 'yes' 'NaN' 21When you import data from a spreadsheet, dataset reads any variables with nonnumeric elements as a cell array of strings. This is why the variable var2 is a cell array of strings. When importing data from a text file, you have more flexibility to specify which nonnumeric expressions to treat as missing using the option TreatAsEmpty.
There are many different missing data indicators in messy.xlsx, such as:
- Empty cells
- A period (.)
- NA
- NaN
- -99
Find observations with missing values.
Display the subset of observations that have at least one missing
value using ismissing. ix = ismissing(messyData,'NumericTreatAsMissing',-99,... 'StringTreatAsMissing',{'NaN','.','NA'}); messyData(any(ix,2),:)
ans = var1 var2 var3 var4 var5 'egh3' '.' 'no' '7' 7 'abk6' '563' '' '563' 563 'wba3' '' 'yes' 'NaN' 14 'poj2' '-99' 'yes' '-99' -99 'gry5' 'NA' 'yes' 'NaN' 21By default, ismissing recognizes the following missing value indicators:
- NaN for numeric arrays
- '' for string arrays
- <undefined> for categorical arrays
Convert string arrays to double arrays.
You can convert the string variables that should be numeric
using str2double.messyData.var2 = str2double(messyData.var2); messyData.var4 = str2double(messyData.var4)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' -99 'yes' -99 -99 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21Now, var2 and var4 are numeric arrays. During the conversion, str2double replaces the nonnumeric elements of the variables var2 and var4 with the value NaN. However, there are no changes to the numeric missing value indicator, -99.
When applying the same function to many dataset array variables, it can sometimes be more convenient to use datasetfun. For example, to convert both var2 and var4 to numeric arrays simultaneously, you can use:
messyData(:,[2,4]) = datasetfun(@str2double,messyData,'DataVars',[2,4],... 'DatasetOutput',true);
Replace missing value indicators.
Clean the data so that the missing values indicated by the code -99 have
the standard MATLAB® numeric missing value indicator, NaN.messyData = replaceWithMissing(messyData,'NumericValues',-99)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' NaN 'yes' NaN NaN 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
Create a dataset array with complete observations.
Create a new dataset array that contains only the complete observations—those
without missing data.ix = ismissing(messyData); completeData = messyData(~any(ix,2),:)
completeData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'bas8' 23 'no' 23 23
Load sample data.
Navigate to the folder containing sample data. Import the data
from the comma-separated text file testScores.csv.cd(matlabroot) cd('help/toolbox/stats/examples') ds = dataset('File','testScores.csv','Delimiter',',')ds = LastName Sex Test1 Test2 Test3 Test4 'HOWARD' 'male' 90 87 93 92 'WARD' 'male' 87 85 83 90 'TORRES' 'male' 86 85 88 86 'PETERSON' 'female' 75 80 72 77 'GRAY' 'female' 89 86 87 90 'RAMIREZ' 'female' 96 92 98 95 'JAMES' 'male' 78 75 77 77 'WATSON' 'female' 91 94 92 90 'BROOKS' 'female' 86 83 85 89 'KELLY' 'male' 79 76 82 80 There are 4 test scores for each of 10 students, in wide format.
Average dataset array variables.
Compute the average (mean) test score for each student in the
dataset array, and store it in a new variable, TestAvg.
Test scores are in columns 3 to 6.Use double to convert the specified dataset array variables into a numeric array. Then, calculate the mean across the second dimension (across columns) to get the test average for each student.
ds.TestAvg = mean(double(ds(:,3:6)),2); ds(:,{'LastName','TestAvg'})ans = LastName TestAvg 'HOWARD' 90.5 'WARD' 86.25 'TORRES' 86.25 'PETERSON' 76 'GRAY' 88 'RAMIREZ' 95.25 'JAMES' 76.75 'WATSON' 91.75 'BROOKS' 85.75 'KELLY' 79.25
Summarize the dataset array using a grouping variable.
Compute the mean and maximum average test scores for each gender.stats = grpstats(ds,'Sex',{'mean','max'},'DataVars','TestAvg')stats = Sex GroupCount mean_TestAvg max_TestAvg male 'male' 5 83.8 90.5 female 'female' 5 87.35 95.25 This returns a new dataset array containing the specified summary statistics for each level of the grouping variable, Sex.
Replace data values.
The denominator for each test score is 100. Convert the test
score denominator to 25.scores = double(ds(:,3:6)); newScores = scores*25/100; ds = replacedata(ds,newScores,3:6)ds = LastName Sex Test1 Test2 Test3 Test4 TestAvg 'HOWARD' 'male' 22.5 21.75 23.25 23 90.5 'WARD' 'male' 21.75 21.25 20.75 22.5 86.25 'TORRES' 'male' 21.5 21.25 22 21.5 86.25 'PETERSON' 'female' 18.75 20 18 19.25 76 'GRAY' 'female' 22.25 21.5 21.75 22.5 88 'RAMIREZ' 'female' 24 23 24.5 23.75 95.25 'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75 'WATSON' 'female' 22.75 23.5 23 22.5 91.75 'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75 'KELLY' 'male' 19.75 19 20.5 20 79.25 The first two lines of code extract the test data and perform the desired calculation. Then, replacedata inserts the new test scores back into the dataset array.
The variable of test score averages, TestAvg, is now the final score for each student.
Change variable name.
Change the variable name to Final.ds.Properties.VarNames{end} = 'Final'; dsds = LastName Sex Test1 Test2 Test3 Test4 Final 'HOWARD' 'male' 22.5 21.75 23.25 23 90.5 'WARD' 'male' 21.75 21.25 20.75 22.5 86.25 'TORRES' 'male' 21.5 21.25 22 21.5 86.25 'PETERSON' 'female' 18.75 20 18 19.25 76 'GRAY' 'female' 22.25 21.5 21.75 22.5 88 'RAMIREZ' 'female' 24 23 24.5 23.75 95.25 'JAMES' 'male' 19.5 18.75 19.25 19.25 76.75 'WATSON' 'female' 22.75 23.5 23 22.5 91.75 'BROOKS' 'female' 21.5 20.75 21.25 22.25 85.75 'KELLY' 'male' 19.75 19 20.5 20 79.25
Load sample data.
load('hospital')The dataset array has 100 observations and 7 variables.
Export to a text file.
Export the dataset array, hospital, to a
text file named hospital.txt. By default, export writes
to a tab-delimited text file with the same name as the dataset array,
appended by .txt.export(hospital)This creates the file hospital.txt in the current working folder, if it does not previously exist. If the file already exists in the current working folder, export overwrites the existing file.
By default, variable names are in the first line of the text file. Observation names, if present, are in the first column.
Export without variable names.
Export hospital with variable names suppressed
to a text file named NoLabels.txt.export(hospital,'File','NoLabels.txt','WriteVarNames',false)There are no variable names in the first line of the created text file, NoLabels.txt.
Export to a comma-delimited format.
Export hospital to a comma-delimited text
file, hospital.csv.export(hospital,'File','hospital.csv','Delimiter',',')
Export to an Excel spreadsheet.
Export hospital to an Excel® spreadsheet
named hospital.xlsx.export(hospital,'XLSFile','hospital.xlsx')By default, the first row of hospital.xlsx has variable names, and the first column has observation names.
Open Dataset Arrays in the Variables Editor
The MATLAB Variables editor provides a convenient interface for viewing, modifying, and plotting dataset arrays.First, load the sample data set, hospital.
load hospital
The dataset
array, hospital, is created in the MATLAB workspace.The dataset array has 100 observations and 7 variables.
To open hospital in the Variables editor, click Open Variable, and select hospital.
The Variables editor opens, displaying the contents of the dataset array (only the first 10 observations are shown here).
In the Variables editor, you can see the names of the seven variables along the top row, and the observations names down the first column.
Modify Variable and Observation Names
You can modify variable and observation names by double-clicking a name, and then typing new text.All changes made in the Variables editor are also sent to the command line.
The sixth variable in the data set, BloodPressure, is a numeric array with two columns. The first column shows systolic blood pressure, and the second column shows diastolic blood pressure. Click the arrow that appears on the right side of the variable name cell to see the units and description of the variable. You can type directly in the units and description fields to modify the text. The variable data type and size are shown under the variable description.
Reorder or Delete Variables
You can reorder variables in a dataset array using the Variables editor. Hover over the left side of a variable name cell until a four-headed arrow appears.After the arrow appears, click and drag the variable column to a new location.
The command for the variable reordering appears in the command line.
You can delete a variable in the Variables editor by selecting the variable column, right-clicking, and selecting Delete Column Variable(s).
The command for the variable deletion appears in the command line.
Add New Data
You can enter new data values directly into the Variables editor. For example, you can add a new patient observation to the hospital data set. To enter a new last name, add a string to the end of the variable LastName.The variable Gender is a nominal array. The levels of the categorical variable appear in a drop-down list when you double-click a cell in the Gender column. You can choose one of the levels previously used, or create a new level by selecting New Item.
You can continue to add data for the remaining variables.
To change the observation name, click the observation name and type the new name.
The commands for entering the new data appear at the command line.
Notice the warning that appears after the first assignment. When you enter the first piece of data in the new observation row—here, the last name—default values are assigned to all other variables. Default assignments are:
- 0 for numeric variables
- <undefined> for categorical variables
- [] for cell arrays
Sort Observations
You can use the Variables editor to sort dataset array observations by the values of one or more variables. To sort by gender, for example, select the variable Gender. Then click Sort, and choose to sort rows by ascending or descending values of the selected variable.When sorting by variables that are cell arrays of strings or of nominal data type, observations are sorted alphabetically. For ordinal variables, rows are sorted by the ordering of the levels. For example, when the observations of hospital are sorted by the values in Gender, the females are grouped together, followed by the males.
To sort by the values of multiple variables, press Ctrl while you select multiple variables.
When you use the Variables editor to sort rows, it is the same as calling sortrows. You can see this at the command line after executing the sorting.
Select a Subset of Data
You can select a subset of data from a dataset array in the Variables editor, and create a new dataset array from the selection. For example, to create a dataset array containing only the variables LastName and Age:- Hold Ctrl while you click the variables LastName and Age.
- Right-click, and select New
Workspace Variable from Selection > New Dataset
Array.
You can use the same steps to select any subset of data. To select observations according to some logical condition, you can use a combination of sorting and selecting. For example, to create a new dataset array containing only males aged 45 and older:
- Sort the observations of hospital by the values in Gender and Age, descending.
- Select the male observations with age 45 and older.
- Right-click, and select New
Workspace Variables from Selection > New
Dataset Array. The new dataset array, hospital2,
is created in the Workspace window.
- You can rename the dataset array in the Workspace
window.
Create Plots
You can plot data from a dataset array using plotting options in the Variables editor. Available plot choices depend on the data types of variables to be plotted.For example, if you select the variable Age, you can see in the Plots tab some plotting options that are appropriate for a univariate, numeric variable.
Sometimes, there are plot options for multiple variables, depending on their data types. For example, if you select both Age and Gender, you can draw box plots of age, grouped by gender.
Ways To Index and Search
There are many ways to index into dataset arrays. For example, for a dataset array, ds, you can:- Use () to create a new dataset array from a subset of ds. For example, ds1 = ds(1:5,:) creates a new dataset array, ds1, consisting of the first five rows of ds. Metadata, including variable and observation names, transfers to the new dataset array.
- Use variable names with dot notation to index individual variables in a dataset array. For example, ds.Height indexes the variable named Height.
- Use observation names to index individual observations in a dataset array. For example, ds('Obs1',:) gives data for the observation named Obs1.
- Use observation or variable numbers. For example, ds(:,[1,3,5]) gives the data in the first, third, and fifth variables (columns) of ds.
- Use logical indexing to search for observations in ds that satisfy a logical condition. For example, ds(ds.Gender=='Male',:) gives the observations in ds where the variable named Gender, a nominal array, has the value Male.
- Use ismissing to find missing data in the dataset array.
Examples
Common Indexing and Searching Methods
This example shows several indexing and searching methods for categorical arrays.Load the sample data.
load hospital;
size(hospital)
ans = 100 7The dataset array has 100 observations and 7 variables.
Index a variable by name. Return the minimum age in the dataset array.
min(hospital.Age)
ans = 25Delete the variable Trials.
hospital.Trials = []; size(hospital)
ans = 100 6Index an observation by name. Display measurements on the first five variables for the observation named PUE-347.
hospital('PUE-347',1:5)
ans = LastName Sex Age Weight Smoker PUE-347 'YOUNG' Female 25 114 falseIndex variables by number. Create a new dataset array containing the first four variables of hospital.
dsNew = hospital(:,1:4); dsNew.Properties.VarNames(:)
ans = 'LastName' 'Sex' 'Age' 'Weight'Index observations by number. Delete the last 10 observations.
hospital(end-9:end,:) = []; size(hospital)
ans = 90 6Search for observations by logical condition. Create a new dataset array containing only females who smoke.
dsFS = hospital(hospital.Sex=='Female' & hospital.Smoker==true,:); dsFS(:,{'LastName','Sex','Smoker'})
ans = LastName Sex Smoker LPD-746 'MILLER' Female true XBR-291 'GARCIA' Female true AAX-056 'LEE' Female true DTT-578 'WALKER' Female true AFK-336 'WRIGHT' Female true RBA-579 'SANCHEZ' Female true HAK-381 'MORRIS' Female true NSK-403 'RAMIREZ' Female true ILS-109 'WATSON' Female true JDR-456 'SANDERS' Female true HWZ-321 'PATTERSON' Female true GGU-691 'HUGHES' Female true WUS-105 'FLORES' Female true