Stata

This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.

  1. Stata Software
  2. Stata Download For Windows
  3. Stata 17
  4. Stata Free Download

Stata tries very hard to make all its commands work the same way. Spending a little time learning the syntax itself will make it much easier to use commands later.

To carry out the examples in this section, you'll need to have created an SFS folder and downloaded the gss_sample data set as described in Managing Stata Files. Create a new do file in that folder called syntax.do, as described in Doing Your Work Using Do Files. To start with it should contain:

capture log close
log using syntax.log, replace
clear all
set more off
use gss_sample
// work will go here
log close

The aim of this document is to provide an introduction to Stata, and to describe the requirements necessary to undertake the basics of data management and analysis. This document is designed to complement rather. Stata is a general-purpose statistical software package with data management, statistical analysis, graphics, simulations, regression, and custom programming capabilities.

The example commands will go after use gss_sample and before log close. Add the example commands to this do file as you go, and run it frequently to see the results.

Commands

Most Stata commands are verbs. They tell Stata to do something: summarize, tabulate, regress, etc. Normally the command itself comes first and then you tell Stata the details of what you want it to do after.

Many commands can be abbreviated: sum instead of summarize, tab instead of tabulate, reg instead of regress. Commands that can destroy data, like replace, cannot be abbreviated.

Variable Lists

A list of variables after a command tells the command which variables to act on. First try sum (summarize) all by itself, and then followed by age:

sum
sum age

If you don't specify which variables sum should act on it will give you summary statistics for all the variables in the data set. In this case that's a pretty long list. Putting age after sum tells it to only give you summary statistics for the age variable.

If you list more than one variable, the command will act on all of them:

sum age yearsjob prestg10

This gives you summary statistics for age, years on the job, and a rating of the respondent's job's prestige.

If Conditions

An if condition tell a command which observations it should act on. It will only act on those observations where the condition is true. This allows you to do things with subsets of the data. An if condition comes after a variable list:

sum yearsjob if sex1

This gives you summary statistics for years on the job for just the male respondents (in the GSS 1 is male and 2 is female).

Note the two equals signs! In Stata you use one equals sign when you're setting something equal to something else (see Creating Variables) and two equals signs when you're asking if two things are equal. Other operators you can use are:

Equal
>Greater than
<Less than
>=Greater than or equal to
<=Less than or equal to
!=Not equals

! all by itself means 'not' and reverses whatever condition follows it.

Satan

Combining Conditions

You can combine conditions with & (logical and) or | (logical or). The character used for logical or is called the 'pipe' character and you type it by pressing Shift-Backslash, the key right above Enter. Try:

sum yearsjob if sex1 & income>=9
sum yearsjob if sex1 | income>=9

The first gives you summary statistics for years on the job for respondents who are male and have a household income of $10,000 or more. The second gives you summary statistics for years on the job for respondents who are male or have a household income of $10,000 or more, a very different group.

Any conditions you combine must be complete. If you want summary statistics for years on the job for respondents who are either black (race2) or 'other' (race3) you can not use:

sum yearsjob if race2 | 3 // don't do this

Stata Software

(What this does and why is left as an exercise for the reader, but it's not what you want.) Instead you should use:

sum yearsjob if race2 | race3 // do this instead Adobe flash player cs3.

Missing Values

If you have missing values in your data, you need to keep them in mind when writing if conditions. Recall that the generic missing value (.) acts like positive infinity, and the extended missing values (.a, .b, etc.) are even bigger. So if you type:

sum yearsjob if age>65

you are not just getting summary statistics for years on the job for respondents who are older than 65. Anyone with a missing value for age is also included. Assuming you're interested in people who are known to be older than 65, you should exclude the people with missing values for age with a second condition:

sum yearsjob if age>65 & age<.

It makes a difference!

Why age<. rather than age!=.? For the age variable, the GSS uses .c for missing and age!=. would not exclude .c. Other variables use different extended missing values, and some use more than one. Using age<. guarantees you're excluding all missing values, even if you don't know ahead of time which ones the data set uses.

Binary Variables

If you have a binary variable coded as 0 or 1, you can take advantage of the fact that to Stata 1 is true and 0 is false. Imagine that instead of a variable called sex coded 1/2, you had a variable called female coded 0/1. Then you could do things like:

sum yearsjob if female
sum yearsjob if !female // meaning 'not female'

Just one thing to be careful of: to Stata everything except 0 is true, including missing. If female had missing values you would need to use:

sum yearsjob if female & female<. // exclude missing values

or:

sum yearsjob if female1 // automatically excludes missing values

Unfortunately the GSS does not code its binary variables 0/1 so you can't actually run these four commands. But many data sets data sets do, and if you have to create your own binary variables you can make them easy to use by coding them 0/1.

Options

Options change how a command works. They go after any variable list or if condition, following a comma. The comma means 'everything after this is options' so you only type one comma no matter how many options you're using.

The detail option tells summarize to calculate percentiles (including the 50th percentile, or median) and some additional moments.

sum yearsjob, detail

Many options can be abbreviated like commands can be—in this case just d would do.

Some options require additional information, like the name of a variable or a number. Any additional information an option needs goes in parentheses directly after the option itself.

Recall that when we did sum all by itself and it gave us summary statistics for all the variables, it put a separator line after every five variables. You can change that with the separator (or just sep) option:

sum, sep(10)

The (10) in parentheses tells the separator option to put a separator between every ten variables. You'll learn more useful options that need additional information in the articles on statistical commands.

By

By allows you to execute a command separately for subgroups within your data. Try:

bysort sex: sum yearsjob

This gives you summary statistics for years on the job for both males and females, calculated separately.

Stata Download For Windows

By is a prefix, so it comes before the command itself. It's followed by the variable (or variables) that identifies the subgroups of interest, then a colon. The data must be sorted for by to work, so bysort is a shortcut that first sorts the data and then executes the by command. Now that the data set is sorted by sex, you can just use by in subsequent commands:

Stata 17

by sex: sum prestg10 Essentials of geology 12th edition pdf.

Complete Do File

Stata Free Download

The following is a do file containing all the example commands in this section:

capture log close
log using syntax.log, replace
clear all
set more off
use gss_sample
sum
sum age
sum age yearsjob prestg10
sum yearsjob if sex1
sum yearsjob if sex1 & income>=9
sum yearsjob if sex1 | income>=9
sum yearsjob if race2 | 3 // don't do this
sum yearsjob if race2 | race3 // do this instead
sum yearsjob if age>65
sum yearsjob if age>65 & age<. // exclude missing values
/* Things you could do if you had female coded 0/1
instead of sex coded 1/2:
sum yearsjob if female
sum yearsjob if !female // meaning 'not female'
sum yearsjob if female & female<. // exclude missing values
sum yearsjob if female1 // automatically excludes missing values
*/
sum yearsjob, detail
sum, sep(10)
bysort sex: sum yearsjob
by sex: sum prestg10
log close

Last Revised: 6/24/2016

  • CITL Data Analytics (Illinois login required)
    The Center for Innovation in Teaching and Learning's Data Analytics group maintains how-to guides on common statistical procedures in Stata and other software. CITL also sponsors workshops on common analysis programs.
  • Stata Training
    StataCorp's training materials include videos, webinars, and on-site training. Fees apply to some services.
  • Princeton Stata Tutorial
    A tutorial provided by Princeton that introduces beginners to basic commands, data management, graphs, and advanced techniques.
  • StataCorp Main Support Page
    Access official Stata documentation, tutorials, resources, and more.
  • Stata FAQ
    Frequently asked questions on Stata's official webpage.
  • UCLA Tutorial
    The Institute for Digital Research and Education at UCLA provides class notes on a range of topics from basic to advanced data management and analysis techniques.
  • Official Stata Channel on YouTube
    Video tutorials of many statistical functions in Stata.
  • Quick Guide
    Found on the University of Pennsylvania's Stata Library Guide