Title: | Data for the 'Mastering Health Data Science Using R' Online Textbook |
---|---|
Description: | Contains ten datasets used in the chapters and exercises of Paul, Alice (2023) "Health Data Science in R" <https://alicepaul.github.io/health-data-science-using-r/>. |
Authors: | Alice Paul [aut, cre], Hannah Eglinton [aut], Jialin Liu [ctb], Joanna Walsh [aut], Xinbei Yu [ctb] |
Maintainer: | Alice Paul <[email protected]> |
License: | CC BY 4.0 |
Version: | 0.1.3 |
Built: | 2025-03-04 06:03:12 UTC |
Source: | https://github.com/alicepaul/hdsinrdata |
32 features of cell nuclei present in digitized images of fine needle aspirates of 212 malignant and 357 benign breast masses.
breastcancer
breastcancer
A data frame with 569 rows and 32 variables. The first two variables are id and diagnosis, and then the mean, standard error, and "worst" or largest (mean of the three largest values) for each of ten features are reported as follows:
ID number
Diagnosis (M = malignant, B = benign)
Mean of mean distances from center to points on the perimeter
Mean of standard deviation of gray-scale values
Mean of perimeter
Mean of area
Mean of local variation in radius lengths
Mean of perimeter^2 / area - 1.0
Mean of severity of concave portions of the contour
Mean of number of concave portions of the contour
Mean of symmetry
Mean of "coastline approximation" - 1
Standard error of mean distances from center to points on the perimeter
Standard error of standard deviation of gray-scale values
Standard error of perimeter
Standard error of area
Standard error of local variation in radius lengths
Standard error of perimeter^2 / area - 1.0
Standard error of severity of concave portions of the contour
Standard error of number of concave portions of the contour
Standard error of symmetry
Standard error of "coastline approximation" - 1
"Worst" or largest (mean of the three largest values) of mean distances from center to points on the perimeter
"Worst" or largest (mean of the three largest values) of standard deviation of gray-scale values
"Worst" or largest (mean of the three largest values) of perimeter
"Worst" or largest (mean of the three largest values) of area
"Worst" or largest (mean of the three largest values) of local variation in radius lengths
"Worst" or largest (mean of the three largest values) of perimeter^2 / area - 1.0
"Worst" or largest (mean of the three largest values) of severity of concave portions of the contour
"Worst" or largest (mean of the three largest values) of number of concave portions of the contour
"Worst" or largest (mean of the three largest values) of symmetry
"Worst" or largest (mean of the three largest values) of "coastline approximation" - 1
All feature values are recoded with four significant digits.
Wolberg,William. (1992). Breast Cancer Wisconsin (Original). UCI Machine Learning Repository. https://doi.org/10.24432/C5HP4Z.
Obtained from the UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
Weekly confirmed Covid-19 cases and deaths at the state and county level in 2020, downloaded from the COVID19 R package.
covidcases
covidcases
A data frame with 69,530 rows and 5 variables.
State (administrative_area_level_2 from Covid-19 Data Hub)
County (administrative_area_level_3 from Covid-19 Data Hub)
Week of 2020
Weekly Covid-19 cases calculated from the Covid-19 Data Hub's cumulative counts of confirmed cases. Note that, according to the Data Hub, "some of these values are negative due to decreasing cumulative counts in the original data provider".
Weekly Covid-19 deaths calculated from the Covid-19 Data Hub's cumulative counts of confirmed deaths. Again, note that "some of these values are negative due to decreasing cumulative counts in the original data provider".
Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi:10.21105/joss.02376"
https://CRAN.R-project.org/package=COVID19
https://covid19datahub.io/index.html
Start and end dates of statewide stay at home orders in response to the Covid-19 pandemic.
lockdowndates
lockdowndates
A data frame with 50 rows and 3 variables:
State
Start date of the statewide order in YYYY-MM-DD format
End date of the statewide order in YYYY-MM-DD format
Raifman, J., Nocka, K., Jones, D., Bor, J., Lipson, S., Jay, J., Cole, M., Krawczyk, N., Benfer, E. A., Chan, P., Galea, S. (2022). COVID-19 US State Policy Database. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-03-30. https://doi.org/10.3886/E119446V143
https://www.openicpsr.org/openicpsr/project/119446/version/V143/view
2020 mobility statistics at the state level from Descartes Labs.
mobility
mobility
A data frame with 9,333 rows and 5 variables:
State (originally admin1)
Date in YYYY-MM-DD format
The number of samples observed in the state on that date (summed across counties)
The median of the max-distance mobility (representing the distance a typical member of a given population moves in a day) for all samples in a county, averaged across counties.
The percent of normal m50 in the region, with normal m50 defined during 2020-02-17 to 2020-03-07, averaged across counties.
Note from the data website: "Data for 2020-04-20, 2020-05-29, 2020-10-08, 2020-12-11 through 2020-12-18, 2021-01-08 through 2021-01-14, 2021-04-07, 2021-04-12 and 2021-04-21 to present did not meet quality control standards, and was not released."
Data was obtained from Descartes Labs https://descarteslabs.com
Warren, Michael S. & Skillman, Samuel W. "Mobility Changes in Response to COVID-19". arXiv:2003.14228 [cs.SI], Mar. 2020. arxiv.org/abs/2003.14228
https://github.com/descarteslabs/DL-COVID-19
Lead, blood pressure, and demographic variables from NHANES 1999-2018, downloaded from the nhanesA package. Data was filtered to adults 20 years of age or older with nonmissing blood lead level, blood pressure, and demographic information.
NHANESsample
NHANESsample
A data frame with 31,265 rows and 15 variables:
Respondent sequence number ("SEQN" in NHANES)
Age ("RIDAGEYR" in NHANES: Best age in years of the sample person at time of HH screening. Individuals 85 and over are topcoded at 85 years of age up to 2006 and individuals 80 and over are topcoded at 80 years of age after 2006.)
Gender ("RIAGENDR" in NHANES)
Race and ethnicity ("RIDRETH1" in NHANES)
Education Level ("DMDEDUC2" in NHANES: What is the highest grade or level of school you have completed or the highest degree you have received?)
Poverty income ratio (PIR): a ratio of family income to poverty threshold ("INDFMPIR" in NHANES)
Smoking status (Combination of SMQ020 (Have you smoked at least 100 cigarettes in your entire life?) and SMQ040 (Do you now smoke cigarettes?) in NHANES: equal to "Still Smoke" if respondent answered "Yes" to SMQ020 and either "Every day" or "Some days" to SMQ040, equal to "Quit Smoke" if respondent answered "Yes" to SMQ020 and "Not at all" to SMQ040, and equal to "Never Smoke" otherwise.)
Year of the Study (Equal to the first year of the two year interval in which the response was recorded - NHANES surveys are grouped in two-year intervals)
Lead (ug/dL): "LBXBPB" in NHANES unless the reported level of lead was less than the lower limit of detection (llod), as defined by the paper cited above, for the relevant year, in which case "LBXBPB" was replaced by llod/sqrt(2))
Body Mass Index Category (kg/m^2): Based on "BMXBMI" in NHANES
Quantile membership for blood lead levels based on the distribution of lead levels in the data
Hypertension Status: Based on "BPQ020" (Have you ever been told by a doctor or other health professional that you had hypertension, also called high blood pressure?) and "BPQ040A" (Because of your high blood pressure/hypertension, have you ever been told to take prescribed medicine?) in NHANES. Equal to 1 if the respondent answered "Yes" to either of these questions, or, if data on either of these questions isn't answered, if SBP >= 130 or DBP >= 80, and equal to 0 otherwise.
Alcohol Use: Based on "ALQ120Q" (In the past 12 months, how often did you drink any type of alcoholic beverage?) up to 2016 and "ALQ121" (the same question, but used after 2016) in NHANES. Equal to "Yes" if the respondent's answer to either of these questions was > 0 and equal to "No" otherwise.
First Diastolic Blood Pressure (mmHg) reading: "BPXDI1" in NHANES.
Second Diastolic Blood Pressure (mmHg) reading: "BPXDI2" in NHANES.
Third Diastolic Blood Pressure (mmHg) reading: "BPXDI3" in NHANES.
Fourth Diastolic Blood Pressure (mmHg) reading: "BPXDI4" in NHANES.
First Systolic Blood Pressure (mmHg) reading: "BPXSY1" in NHANES.
Second Systolic Blood Pressure (mmHg) reading: "BPXSY2" in NHANES.
Third Systolic Blood Pressure (mmHg) reading: "BPXSY3" in NHANES.
Fourth Systolic Blood Pressure (mmHg) reading: "BPXSY4" in NHANES.
Data was obtained from the nhanesA package https://CRAN.R-project.org/package=nhanesA.
Variable selection and feature engineering were conducted in an effort to replicate the analyses conducted by
Huang, Z. (2022). Association Between Blood Lead Level With High Blood Pressure in US (NHANES 1999-2018). Frontiers in Public Health, 892.
https://www.frontiersin.org/articles/10.3389/fpubh.2022.836357/full.
Variables relating to demographic information, frequency of tobacco (e-cigs, cigarettes, and cigars) use, and methods of obtaining said tobacco as reported by students on the 2021 NYTS.
nyts
nyts
A data frame with 20,413 rows and 35 variables:
Survey Setting: Answer to the question "Where are you currently taking the survey?"
Age: Answer to QN1: "How old are you?"
Sex: Answer to QN2: "What is your sex?"
Grade: Answer to QN3: "What grade are you in?"
Race and Ethnicity: Equal to "Hispanic" if any of QN4B ("Are you Hispanic, Latino, Latina, or of Spanish origin?" (Yes, Mexican, Mexican American, Chicano, or Chicana)), QN4C ("Are you Hispanic, Latino, Latina, or of Spanish origin?" (Yes, Puerto Rican)), QN4D ("Are you Hispanic, Latino, Latina, or of Spanish origin?" (Yes, Cuban)), or QN4E ("Are you Hispanic, Latino, Latina, or of Spanish origin?" (Yes, Another Hispanic, Latino, Latina, or Spanish origin)) are selected. Otherwise, equal to "non-Hispanic Black" if QN5C ("What race or races do you consider yourself to be?" (Black or African American)) is selected, equal to "non-Hispanic White" if QN5E ("What race or races do you consider yourself to be?" (White)) is selected, and equal to "non-Hispanic other race" if QN5A ("What race or races do you consider yourself to be?" (American Indian or Alaska Native)), QN5B ("What race or races do you consider yourself to be?" (Asian)), or QN5D ("What race or races do you consider yourself to be?" (Native Hawaiian or Other Pacific Islander)) is selected.
Speaks Language other than English at Home: Answer to QN154: "Do you speak a language other than English at home?"
Grades in the Past Year: Answer to QN165: "During the past 12 months, how would you describe your grades in school?"
LGBT Status: Equal to "Yes" if respondent answered QN155 ("Which of the following best describes you") with "Gay or Lesbian" or "Bisexual" or if respondent answered QN156 ("Some people describe themselves as transgender when their sex at birth does not match the way they think or feel about their gender. Are you transgender?") with "Yes, I am transgender". Equal to "Not Sure" if respondent answered QN155 with "Not Sure" or answered QN156 with "I am not sure if I am transgender". Equal to "No" if respondent answered QN155 with "Heterosexual (straight)" and answered QN156 with "No, I am not transgender".
Psychological Distress: As defined in the online supplement for the linked paper: "Psychological distress was assessed with the Patient Health Questionnaire for Depression and Anxiety (PHQ-4), a composite scale made up of four questions: “During the past two weeks, how often have you been bothered by any of the following problems?”: QN157A: Little interest or pleasure in doing things; QN157B: Feeling down, depressed, or hopeless; QN157C: Feeling nervous, anxious, or on edge; QN157D: Not being able or stop or control worrying. Response options were provided with a numeric value of 0 for “not at all,” 1 for “several days,” 2 for “more than half of the days,” and 3 for “nearly every day”. Responses were summed (range: 0 – 12) and categorized as none (0–2), mild (3–5), moderate (6–8) and severe (9–12)."
Family Affluence: As defined in the online supplement for the linked paper: "Family affluence was assessed with the Family Affluence Scale (FAS), a composite scale made up of four questions. Numeric values were assigned to each response and summed across responses: QN161: “Does your family own a vehicle (such as a car, van, or truck)? (No=0; Yes, one=1; Yes, two or more=2); QN162: “Do you have your own bedroom?” (No=0; Yes=1); QN163: “How many computers (including laptops and tablets, not including game consoles and smartphones) does your family own?” (None=0; One=1; Two=2; More than two=3); and QN164: “During the past 12 months, how many times did you travel on vacation with your family? (Not at all=0; Once=1; Twice=2; More than twice=3). Summed responses (range: 0–9) were categorized into low (0–5), medium (6–7), and high (8–9)."
Days of E-cig Use in the Past 30 days: Answer to QN9: "During the past 30 days, on how many days did you use e-cigarettes?". Equal to 0 if respondent answered QN6 ("Have you ever used an e-cigarette, even once or twice") with "No"
Days of Cigarette Use in the Past 30 days: Answer to QN38: "During the past 30 days, on how many days did you smoke cigarettes?". Equal to 0 if respondent answered QN35 ("Have you ever smoked a cigarette, even one or two puffs") with "No"
Days of Cigar Use in the Past 30 days: Answer to QN53: "During the past 30 days, on how many days did you smoke cigars, cigarillos, or little cigars?". Equal to 0 if respondent answered QN51 ("Have you ever smoked a cigar, cigarillo, or little cigar, even one or two puffs?") with "No"
Perceived Percentage of Students in Respondent's Grade who Smoke Cigarettes: Answer to QN125: "Out of every 10 students in your grade at school, how many do you think smoke cigarettes?" divided by 10
Perceived Percentage of Students in Respondent's Grade who Use e-cigarettes: Answer to QN126: "Out of every 10 students in your grade at school, how many do you think use e-cigarettes?" divided by 10
"I bought them myself during the past 30 days": Equal to 1 if respondent selected any of QN20AA, QN20BA, QN20CA (During the past 30 days, how did you get your _____? (I bought them myself) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I had someone else buy them for me during the past 30 days": Equal to 1 if respondent selected any of QN20AB, QN20BB, QN20CB (During the past 30 days, how did you get your _____? (I had someone else buy them for me) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I asked someone to give me some during the past 30 days": Equal to 1 if respondent selected any of QN20AC, QN20BC, QN20CC (During the past 30 days, how did you get your _____? (I asked someone to give me some) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"Someone offered them to me during the past 30 days": Equal to 1 if respondent selected any of QN20AD, QN20BD, QN20CD (During the past 30 days, how did you get your _____? (Someone offered them to me) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I got them from a friend during the past 30 days": Equal to 1 if respondent selected any of QN20AE, QN20BE, QN20CE (During the past 30 days, how did you get your _____? (I got them from a friend) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I got them from a family member during the past 30 days": Equal to 1 if respondent selected any of QN20AF, QN20BF, QN20CF (During the past 30 days, how did you get your _____? (I got them from a family member) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I took them from a store or another person during the past 30 days": Equal to 1 if respondent selected any of QN20AG, QN20BG, QN20CG (During the past 30 days, how did you get your _____? (I took them from a store or another person) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I got them in some other way during the past 30 days": Equal to 1 if respondent selected any of QN20AH, QN20BH, QN20CH (During the past 30 days, how did you get your _____? (I got them in some other way) for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I didn't buy tobacco products during the past 30 days": Equal to 1 if respondent selected all of QN21AA, QN21BA, QN21CA ("During the past 30 days, where did you buy your ____? (I did not buy ____ during the past 30 days)" for each tobacco product) or equal to 1 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from another person (a friend, family member, or someone else) during the past 30 days": Equal to 1 if respondent selected any of QN21AB, QN21BB, QN21CB ("During the past 30 days, where did you buy your ____? (I bought them from another person (a friend, family member, or someone else))" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a gas station or convenience store during the past 30 days": Equal to 1 if respondent selected any of QN21AC, QN21BC, QN21CC ("During the past 30 days, where did you buy your ____? (A gas station or convenience store)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a grocery store during the past 30 days": Equal to 1 if respondent selected any of QN21AD, QN21BD, QN21CD ("During the past 30 days, where did you buy your ____? (A grocery store)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a drugstore during the past 30 days": Equal to 1 if respondent selected any of QN21AE, QN21BE, QN21CE ("During the past 30 days, where did you buy your ____? (A drugstore)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a mall or shopping center kiosk/stand during the past 30 days": Equal to 1 if respondent selected any of QN21AF, QN21BF, QN21CF ("During the past 30 days, where did you buy your ____? (A mall or shopping center kiosk/stand)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a vending machine during the past 30 days": Equal to 1 if respondent selected any of QN21AG, QN21BG, QN21CG ("During the past 30 days, where did you buy your ____? (A vending machine)" for each tobacco product. Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them on the Internet (such as a product website or store website like eBay or Facebook Marketplace) during the past 30 days": Equal to 1 if respondent selected any of QN21AH, QN21BH, QN21CH ("During the past 30 days, where did you buy your ____? (On the Internet (such as a product website or store website like eBay or Facebook Marketplace))" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them through the mail during the past 30 days": Equal to 1 if respondent selected any of QN21AI, QN21BI, QN21CI ("During the past 30 days, where did you buy your ____? (through the mail)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them through a delivery service (such as DoorDash or Postmates) during the past 30 days": Equal to 1 if respondent selected any of QN21AJ, QN21BJ, QN21CJ ("During the past 30 days, where did you buy your ____? (through a delivery service (such as DoorDash or Postmates))" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from a vape shop or tobacco shop during the past 30 days": Equal to 1 if respondent selected any of QN21AK, QN21BK, QN21CK ("During the past 30 days, where did you buy your ____? (a vape shop or tobacco shop)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
"I bought them from some other place not listed here during the past 30 days": Equal to 1 if respondent selected any of QN21AL, QN21BL, QN21CL ("During the past 30 days, where did you buy your ____? (some other place not listed here)" for each tobacco product). Equal to 0 if days used in the past 30 days is equal to 0 for all three tobacco products.
Data was downloaded from the CDC's website at the following link:
https://www.cdc.gov/tobacco/about-data/surveys/national-youth-tobacco-survey.html.
Variables were selected and defined in a similar manner to those in
Park-Lee, E., Gentzke, A. S., Ren, C., Cooper, M., Sawdey, M. D., Hu, S. S., & Cullen, K. A. (2023). Impact of Survey Setting on Current Tobacco Product Use: National Youth Tobacco Survey, 2021. Journal of Adolescent Health, 72(3), 365-374.
https://pubmed.ncbi.nlm.nih.gov/36470692/
Information from patient-reported pain assessments using the Collaborative Health Outcomes Information Registry (CHOIR) at baseline and at a 3-month follow-up.
pain
pain
A data frame with 21,659 rows and 92 variables. Data and variable descriptions were downloaded from the "S1 Dataset".
Deidentified study identification number
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Body Region Selected = 1; not selected = 0
Pain intensity NRS (0-10)
PROMIS physical function T-score, range 0-100
PROMIS pain behavior T-score, range 0-100
PROMIS depression T-score, range 0-100
PROMIS anxiety T-score, range 0-100
PROMIS sleep disturbance T-score, range 0-100
PROMIS pain interference, range 0-100
PROMIS global mental health, range 0-100
PROMIS global physical health, range 0-100
Age at baseline assessment extracted from EMR
Body Mass Index at baseline extracted from EMR
Charlson Comorbidity Index extracted from EMR
Pain intensity NRS at follow up (range 0 - 10)
Patient reported gender, "male" or "female", derived from EMR
Patient reported race, 17 categories, EMR derived
Binary Charlson Comorbidity Index: "No comorbidity" CCI score = 0; "Any comorbidity" CCI score > 0
Medicaid payor: "yes" or "no"
Here is a key for the coded body pain regions (S2 Fig from the linked paper):
Note that, as described in the paper, PROMIS is short for Patient-Reported Outcomes Measurement Information System: the source of the validated instruments for pain assessment used in the adaptive computerized test given to patients in accordance with the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT). EMR refers to the electronic medical record in the University of Pittsburgh's Patient Outcomes Repository for Treatment registry (PORT).
Alter, B. J., Anderson, N. P., Gillman, A. G., Yin, Q., Jeong, J. H., & Wasan, A. D. (2021). Hierarchical clustering by patient-reported pain distribution alone identifies distinct chronic pain subgroups differing by pain intensity, quality, and clinical outcomes. PloS one, 16(8), e0254862.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0254862
Demographic and health data collected from primary care clinic patients presenting with TB symptoms in rural South Africa (Kharitode study) and urban Uganda (STOMP study).
tb_diagnosis
tb_diagnosis
A data frame with 1762 rows and 11 variables:
TB test result (1 = positive, 0 = negative)
Age group
Answer to the question "What is your HIV status?" (1 = positive, 0 = negative)
Self-reported history of diabetes (1 = diabetes, 0 = no diabetes)
Answer to the question "Do you smoke tobacco?" (1 = "yes" or "not currently, but formally", 0 = "no, never")
Answer to the question "Have you ever been diagnosed with TB in the past?" (1 = yes, 0 = no)
Sex (1 = male, 0 = female)
Answer to the question "What is the highest grade of education that you have attained?" (1 = Grade 12 or lower, 0 = Any postgraduate education or higher)
Answer to the question "How long had you had a TB symptom (cough, fever, night sweats, weight loss) before you came to clinic?" (1 = >2 weeks, 0 = <2 weeks)
Number of TB symptoms (cough, fever, night sweats, weight loss)
Country in which data were collected (South Africa = Kharitode study, Uganda = STOMP study)
Baik, Y., Rickman, H. M., Hanrahan, C. F., Mmolawa, L., Kitonsa, P. J., Sewelana, T., Nalutaaya, A., Kendall, E. A., Lebina, L., Martinson, N., Katamba, A., & Dowdy, D. W. (2020). A clinical score for identifying active tuberculosis while awaiting microbiological results: Development and validation of a multivariable prediction model in sub-Saharan Africa. PLoS medicine, 17(11), e1003420. doi:10.1371/journal.pmed.1003420
The data are held in the Johns Hopkins University Data Services database and available at doi:10.7281/T1/W2AG3A.
Demographic and health data collected from primary care clinic patients presenting with TB symptoms in rural South Africa.
tb_diagnosis_raw
tb_diagnosis_raw
A data frame with 1634 rows and 34 variables:
Did the individual consent to participate in the study? (1 = Yes)
Is the participant a patient recently tested for TB? (1 = Yes)
Has informed consent been provided by the participant if age 18 or older? (1 = Yes; 2 = No; 77 = Under 18)
Has parental consent and adolescent/child assent been provided if the participant is less than 18 years of age? (1 = Yes, 2 = No)
Complete? (2 = Yes)
Is the participant TB-negative or TB-positive? (1 = Positive; 2 = Negative)
Age group
Sex (1 = Male; 2 = Female)
What is your HIV status? (1 = Positive, 2 = Negative, 3 = Unknown, 99 = Refused)
Do you have any other medical conditions? – None
Do you have any other medical conditions? – Diabetes
Do you have any other medical conditions? – Refused
Do you have any other medical conditions? – Missing
On the day of your clinic visit, which of the following symptoms did you have? – Cough
On the day of your clinic visit, which of the following symptoms did you have? – Fever
On the day of your clinic visit, which of the following symptoms did you have? – Weight loss
On the day of your clinic visit, which of the following symptoms did you have? – Drenching sweats at night
On the day of your clinic visit, which of the following symptoms did you have? – Pain in my chest
On the day of your clinic visit, which of the following symptoms did you have? – No symptoms
On the day of your clinic visit, which of the following symptoms did you have? – Missing
How long had you had that symptom before you came to clinic that day? Enter unit of response. (1 = Days, 2 = Weeks, 3 = Months, 4 = Years, 77 = Unknown)
How long had you had that symptom before you came to clinic that day? Length of time in months.
How long had you had that symptom before you came to clinic that day? Length of time in years.
How long had you had that symptom before you came to clinic that day? Length of time in days.
How long had you had that symptom before you came to clinic that day? Length of time in weeks.
What is the highest grade of education that you have attained? (0 = None; 1 = Grade 1; 2 = Grade 2; 3 = Grade 3; 4 = Grade 4; 5 = Grade 5; 6 = Grade 6; 7 = Grade 7; 8 = Grade 8; 9 = Grade 9; 10 = Grade 10; 11 = Grade 11; 12 = Grade 12; 13 = Any postgraduate education; 14 = Attained postgraduate degree; 99 = Missing)
Are you currently taking antiretrovirals for your HIV? (1 = Yes; 2 = No; 77 = Don't know or not applicable)
Do you smoke tobacco? (1 = Yes; 2 = Not currently, but formerly; 3 = No, never; 88 = Refused; 99 = Missing)
Have you ever been diagnosed with TB in the past? (1 = Yes, 2 = No, 77 = Don't know)
If you were to develop a mild cough, how long would it likely be before you saw a doctor or other healthcare professional for a diagnosis? Unit of response. (1 = Days; 2 = Weeks; 3 = Months; 4 = Years; 77 = Don't know)
If you were to develop a mild cough, how long would it likely be before you saw a doctor or other healthcare professional for a diagnosis? Length of time in days.
If you were to develop a mild cough, how long would it likely be before you saw a doctor or other healthcare professional for a diagnosis? Length of time in weeks
If you were to develop a mild cough, how long would it likely be before you saw a doctor or other healthcare professional for a diagnosis? Length of time in months.
If you were to develop a mild cough, how long would it likely be before you saw a doctor or other healthcare professional for a diagnosis? Length of time in years.
Baik, Y., Rickman, H. M., Hanrahan, C. F., Mmolawa, L., Kitonsa, P. J., Sewelana, T., Nalutaaya, A., Kendall, E. A., Lebina, L., Martinson, N., Katamba, A., & Dowdy, D. W. (2020). A clinical score for identifying active tuberculosis while awaiting microbiological results: Development and validation of a multivariable prediction model in sub-Saharan Africa. PLoS medicine, 17(11), e1003420. doi:10.1371/journal.pmed.1003420
The data are held in the Johns Hopkins University Data Services database and available at doi:10.7281/T1/W2AG3A.
Texas abortion counts and rates by race/ethnicity and county of residence from 2016 to 2021 from the Texas Department of State Health Services (up to June 2018) and the Health and Human Services Commission since then.
tex_itop
tex_itop
A data frame with 1,524 rows and 18 variables:
County of residence in Texas
Total number of abortions
Total number of abortions among Asian women between the ages of 15 and 44
Total number of abortions among Hispanic women between the ages of 15 and 44
Total number of abortions among White women between the ages of 15 and 44
Total number of abortions among Black women between the ages of 15 and 44
Total number of abortions among Native American women between the ages of 15 and 44
Total number of abortions among women of other races or ethnicities between the ages of 15 and 44
year
Indicator for whether the county is 'rural' or 'urban' according to the Texas Department of Housing and Community Affairs
Abortion rate per 1000 women between the ages of 15 and 44
Abortion rate per 1000 Asian women between the ages of 15 and 44
Abortion rate per 1000 Hispanic women between the ages of 15 and 44
Abortion rate per 1000 White women between the ages of 15 and 44
Abortion rate per 1000 Black women between the ages of 15 and 44
Abortion rate per 1000 Native American women between the ages of 15 and 44
Abortion rate per 1000 women of other races or ethnicities between the ages of 15 and 44
Indicator for whether the county is urban, suburban, or rural according to the RUCC (rural-urban continuum codes) from the U.S. Department of Agriculture in 2013. Counties with Rural-Urban Continuum codes of 1-3 were categorized as urban, counties with codes of 4-7 were categorized as suburban, and counties with codes of 8 or 9 were categorized as rural.
Note from the data website: for the year 2020, "Data do not include 82 reports submitted after statutory deadlines and that were not available when annual data were compiled."
Abortion counts by county and race/ethnicity were obtained from Texas Health and Human Services ISTOP Statistics at the following link:
https://www.hhs.texas.gov/about/records-statistics/data-statistics/itop-statistics
To calculate abortion rates, total female populations between the ages of 15 and 44 were retrieved using the tidycensus package in R:
https://CRAN.R-project.org/package=tidycensus
Census codes for females between the ages of 15 and 44 by each race/ethnicity were retrieved from the following website:
https://api.census.gov/data/2020/dec/dhc/variables.html.
Information on whether counties are categorized as rural or urban was obtained from the 2022 Index of Texas Counties from the Texas Department of Housing and Community Affairs.
The 2013 Rural-Urban Continuum Codes from the U.S. Department of Agriculture were obtained from the following site:
https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/