(Organise and restructure your data, specify special structures, and filter out unwanted observations)
Filter ... (dataset)
• Levels of a categorical variable
• Numeric condition
• Row number
• Randomly
Delete current dataset
Merge/Join datasets:
• Join datasets (several types of join)
• Append rows
Survey design (for data from complex survey designs)
• About survey data
• Specify design ...
• Specify replicate design ...
• Post stratify ... (or rake or calibrate)
• Remove design ...
This tool provides several methods for filtering the dataset. The window that opens has four options for you to choose from:
After selecting a categorical variable from the drop down box, you can select which levels you want to keep in the data set.
This allows you to define a condition with which to filter your data.
For example, you could include only the observations of height
over 180 cm by
height
from the drop down menu,>
symbol, and180
in the third box.
Exclude a range of row numbers as follows:
Similarly, 1, 5, 99, 101:1000 will exclude rows 1, 5, 99, and everything from 101 to 1000
Essentially, this allows you to perform bootstrap randomisation manually. The current behaviour is this:
The observations are drawn randomly without replacement from the data set.
Sort the rows of the data by one or more variables
This function essentially allows you to obtain "summaries" of all of the numeric variables in the data set for combinations of categorical variables.
gender = {male, female}
and ethnicity = {white, black, asian, other}
will result in a data set with 2x4 rows.gender (cat)
and height (num)
, and if the user selects Mean
and Sd
,
then the new data set will have the columns gender
, height.Mean
and height.Sd
.
In the rows, the values will be for that combination of categorical variables;
the row for gender = female
will have the mean height of the females, and the standard deviation of height for the females.
A visual example of this would be do drag height
into the Variable 1 slot, and gender
into the Variable 2 slot.
Clicking on "Get Summary" would provide the same information. The advantage of using Aggregate is that the summaries are calculated for every numeric variable in the data set, not just one of them.
Convert from table form (rows corresponding to subjects) to long form (rows corresponding to observations).
In many cases, the data may be in tabular form, in which multiple observations are made but placed in different columns.
An example of this may be a study of blood pressure on patients using several medications. The columns of this data set may be:
patient.id
, gender
, drug
, Week1
, Week2
, Week3
. Here, each patient has their own row in the data set, but each row contains three observations of blood pressure.
patient.id | gender | drug | Week1 | Week2 | Week 3 |
---|---|---|---|---|---|
1 | male | A | 130 | 125 | 120 |
2 | male | B | 140 | 130 | 110 |
3 | female | A | 120 | 119 | 116 |
We may want to convert to long form, where we have each observation in a new row, and use a categorical variable to differentiate the weeks.
In this case, we would select Week1
, Week2
, and Week3
as the variables in the list. The new data set will have the columns
patient.id
, gender
, drug
, Stack.variable
("Week"), and stack.value
("blood pressure").
patient.id | gender | drug | stack.variable | stack.value |
---|---|---|---|---|
1 | male | A | Week1 | 130 |
1 | male | A | Week2 | 125 |
1 | male | A | Week3 | 120 |
2 | male | B | Week1 | 140 |
2 | male | B | Week2 | 130 |
2 | male | B | Week3 | 110 |
3 | female | A | Week1 | 120 |
3 | female | A | Week2 | 119 |
3 | female | A | Week3 | 116 |
Of course, you can rename the variables as appropriate using "Manipulate Variables" > "Rename Variables".
It offers three types of tools for the users to modify their dataset:
• Reshape dataset
• Separate column
• Unite columns
This allows you to select a column or multiple columns from your dataset.
key
) is populated by the column name(s) of the selected column(s)value
) will contain the column value from the selected columns.
You can select a column to spread out into multiple columns (the column is named key
in the example).
It will use the column values of the selected column as a set of names for new columns.
You then select another column with corresponding values to be put into the new columns (the column is named value
in the example).
Allows you to separate a column into several columns using a user-defined separator.
If no separator is found, the additional columns formed will contain NA
s.
In the example on to the right, we have asked to separate column A using an underscore (“_”) as a separator.
Because only column A is being separated, column B (or any other columns) is left unchanged in the resulting new dataset
The maximum number of fields in column A after separation is 3 ("A_0.7K_2K") so column A in the original dataset is being replaced by 3 columns with default column names (Col1, etc).
Instead of forming more columns this version of Separate keeps the same number of columns, with the same names, but writes more rows.
Using the same data as in the example above, the entry "A_0.7K_2K" in column A in the original dataset results in 3 rows in column A in the new dataset.
Allows you to select multiple columns and “unite” them using a defined separator (defaults to “_”). The united column name will be the combination of the selected columns with a “.” in between.
In this example, column “v1999” and column “v2000” are united by “”. The new column name is “v1999.v2000”._
This "joins", or brings together, information in two data sets: the current dataset in iNZight and a newly imported dataset (read in using the Import data facility) shown at the mid-right.
Left Join: The most important joining method is called a Left Join, the main purpose of which is to add new variables to the original dataset by extracting the information from the new dataset.
Matching rows: (Not yet fully implemented in Lite.) The main problem is to identify what pieces of information belong together. The most straightforward case occurs where there is a variable in the original dataset which is a unique identifier. If that variable is also in the imported dataset (even if under a different name) we can use it to match up the data which belongs to the same unit/entity.
To partially automate the process, iNZight looks for variables with the same name in both datasets (originally x1, x2, and x3 in the Example to the right) and offers those for determining matches.
In the Example, we have rejected x3 using the delete button beside it and so have effectively told the program that it is units with the same values of both x1 and x2 that belong together.
Click the Join button at the bottom once you are happy with the way the data is being joined.
The details of how the data is treated depend on the type of Join and we will document that after finishing describing the Example.
In the Example, x4 is a new variable so that has been added to the preview-dataset. A complication is x3 which is in both datasets but with different values for the "same" units. So the program has decided to make two variables, one for the x3 values from the original dataset and one for the x3 values from the new dataset.
Left Join
NA
s for the additional columnsHow other joins differ from the Left Join
Inner Join: Only use rows corresponding to matches between the two datasets
[Right Join: iNZight does not have this. Just import the datasets in the reverse order and use a left join.]
The following are just used to filter data. No columns are added to the join from the new dataset.
Semi Join: Use only rows in the original which have a match in the new
Adds to the bottom of the original dataset rows from a newly imported dataset (imported using the Import facility provided).
An error will be reported and not appending will be possible if two column names match but their types (e.g. numeric or categorical) do not match
Where data is periodically being added to a dataset, this facility can be used to keep track of when each row was added -- thus facilitating analyses of the data as it was up until any particular time point
• About survey data
• Specify design ...
• Specify replicate design ...
• Post stratify ... (or rake or calibrate)
• Remove design ...
It is important that specialist methods be applied when analyzing data obtained using complex survey designs. Failure to do so typically leads biased estimates and incorrect standard errors, confidence intervals and p-values.
When a survey design has been defined almost all relevant parts of iNZight will apply analysis and graphics methods appropriate for data obtained using this survey design by applying functions in R's survey package.
Important difference for Get Summary. Whereas, generally, iNZight's Get Summary provides summary information about the dataset itself, when a survey design is specified Get Summary provides estimates of population quantities -- clearly labeled as such. (Raw summaries of survey data are often meaningless because of unequal probabilities of selection.)
Regular statistical software analyses data as if the data were collected using simple random sampling. Many surveys, however, are conducted using more complicated sampling methods. Not only is it often nearly impossible to implement simple random sampling, more complex methods are more efficient both financially and statistically. These methods use some or all of stratified sampling, cluster sampling and unequal sampling rates whereby some parts of a population are sampled more heavily (i.e. with higher probabilities of selection) than others parts. These sampling features have to be allowed for in the analysis. While sometimes it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far wrong the results from non-survey software will be.
Survey designs are typically specified to analysis programs either by specifying a survey design in terms of weighting, strata and clustering variables etc in the data set, or by selecting a set of variables in the dataset containing so-called replicate weights. Post-stratification/raking/calibration facilitate using additional population information (external to the survey data) to improve survey estimates. This is done after a survey design has been specified (by either method).
Specifying a survey design in terms of weighting, strata and clustering variables etc.
iNZight’s survey methods cater for simple random sampling, stratified random sampling (random sampling within each population stratum), cluster sampling and multistage cluster sampling, and complex designs involving cluster sampling within population strata.
For more information on quantities referred to in the dialog box see the documentation of the svydesign function in R's survey package.
Because making public factors like cluster membership can make it easier than survey agencies are comfortable with to identify individuals, many agencies do not distribute such information to outsiders. Instead they distribute sets of so-called replicate weights, slightly varying copies of the sampling weights variable that still enable survey analysis programs to make the proper adjustments to analyses of survey data.
Right-hand panel
For more information on quantities referred to above
This allows for poststratifying/raking/calibrating a design that has already been defined using either of the methods above above. (Technically the design is updated using the calibrate function in R's survey package.)
This allows a data-analyst to improve estimation by augmenting the information in the survey data by adding information on the whole population where this is available from other sources. Categorical variables in the data set are offered as possible poststratification/raking/calibration candidates. Corresponding population counts can be input by typing or reading from files.
In the example in the screenshot, there are 3 categorical variables in the data set offered as possible candidates and 2 have been selected. The design would then be calibrated using user-supplied information on population counts for the all the categories of both admin_type and ethgp.
This is fine when you only want to use information about single variables, but what if you have population information on the cross-classification (all possible combinations) of admin_type and ethgp, say? Then you would have to create a new variable in the dataset, called say admin_type.ethgp that has all these combinations. This can be done with Variables > Categorical Variables > Combine categorical variables.
Warning: Currently the new variables have to be set up before specifying a survey design. If you only think about it later you will need to use Remove design (next item), set up new variables and then re-specify the design.
For more information on quantities referred to in the dialog box see here ....
Discard design information and revert to using standard methods of analysis.
When a frequency column is specified, graphics and summaries respond to a combinations-of-variables-plus-their-frequencies data structure. Only categorical variables are retained.
The expand table option can be used to work with both numeric and categorical variables. The storage and efficiency savings of working with a unique combinations of variables plus frequencies will typically be very much smaller here.
Often, we want to validate that the data in a dataset adheres to our expectations about how that data should behave based on external real world-knowledge.
For example, we expect the heights of humans are not negative or someone cannot work for more than 24 hours in a single day.
The validation window gives us the ability to define or import rules, and check whether the data conforms to those rules identifying any exceptions.
Using the first example, the rule to check whether heights are above 0 can be written in this textbox as height > 0
in this text box (given there is a variable named height
in our dataset).
The results of each rule are presented in a table at the bottom of the window and show the number of observations that were checked ("Total"), the number of passes and failures ("Passes" and "Fails" respectively), and the fail percentage ("Fails (%)"). Initially this table is sorted by failure percentage, but you can click on other column headers to order the list in other ways.
This is more useful than employing row numbers (the default setting) because unique-identifier values remain unchanged in the data when rows are deleted whereas row numbers will change.
This breakdown will provide details about the observations which failed on that particular rule, giving the row numbers (or unique-identifier values if a unique-identifier has been selected) of these observations and the values used to assess the rule.
Validation rules or changes to imported rule files are discarded once the validation window is closed.
They are saved into a text file on your computer that can be imported again using the "Open Rules" button or viewed using a text editor.
The rules you use to validate the dataset do not need to be simple comparisons between a variable and a static value as in the previous example. More complex rules can be built by performing calculations on the variables, e.g. weight / height^2 < 50
will verify that each observation's body mass index is below 50. The values of each variable contained in the calculation as well as the end result are provided in the detailed breakdown.
For example, to check that the income of an individual (contained in variable Income
) is no more than 1000 times their number of hours per week (contained in variable Hours
), we can use the following rule: Income <= Hours * 1000
. This will calculate a different value to compare income against for each observation.
For more information on what is possible using validation rules, the vignettes and help files of the underlying R package (validate) might be useful: Introduction to Validate vignette and Validate package (in particular, the syntax section of the reference manual).
Restores the data set to the way it was when it was initially imported.