Censoring (statistics)

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.For example, suppose a study is conducted to measure the impact of a drug on mortality rate.In such a study, it may be known that an individual's age at death is at least 75 years (but may be more).The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of missing data, where the observed value of some variable is unknown.Censoring should not be confused with the related idea truncation.Interval censoring can occur when observing a value requires follow-ups or inspections.Estimation methods for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.[1] A common misconception with time interval data is to class as left censored intervals where the start time is unknown.In these cases we have a lower bound on the time interval, thus the data is right censored (despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).Special techniques may be used to handle censored data.Tests with specific failure times are coded as actual failures; censored data are coded for the type of censoring and the known interval or limit.Special software programs (often reliability oriented) can conduct a maximum likelihood estimation for summary statistics, confidence intervals, etc.One of the earliest attempts to analyse a statistical problem involving censored data was Daniel Bernoulli's 1766 analysis of smallpox morbidity and mortality data to demonstrate the efficacy of vaccination.[2] An early paper to use the Kaplan–Meier estimator for estimating censored costs was Quesenberry et al. (1989),[3] however this approach was found to be invalid by Lin et al.[4] unless all patients accumulated costs with a common deterministic rate function over time, they proposed an alternative estimation technique known as the Lin estimator.To incorporate censored data points in the likelihood the censored data points are represented by the probability of the censored data points as a function of the model parameters given a model, i.e. a function of CDF(s) instead of the density or probability mass.to get: Equivalently, the mean time to failure is: This differs from the standard MLE for the exponential distribution in that the any censored observations are considered only in the numerator.
Example of five replicate tests resulting in four failures and one suspended time resulting in censoring.
statisticsmeasurementobservationmortality ratemeasuring instrumentmissing datatruncationintervalroundingstatistically independentEstimation methodsreliabilitymaximum likelihood estimationDaniel BernoullismallpoxvaccinationKaplan–Meier estimatorreplicatecensored regressiontobit modelJames Tobinlikelihoodsurvival functionexponential distributionmaximum likelihood estimate (MLE)mean time to failureData analysisDetection limitImputation (statistics)Inverse probability weightingSampling biasSaturation arithmeticSurvival analysisWinsorisingAmerican Journal of Public HealthBiometricsClinicoEconomics and Outcomes ResearchWikidataMann, N. R.OutlineDescriptive statisticsContinuous dataCenterArithmeticArithmetic-GeometricContraharmonicGeneralized/powerGeometricHarmonicHeronianLehmerMedianDispersionAverage absolute deviationCoefficient of variationInterquartile rangePercentileStandard deviationCentral limit theoremMomentsKurtosisL-momentsSkewnessCount dataIndex of dispersionContingency tableFrequency distributionGrouped dataDependencePartial correlationPearson product-moment correlationRank correlationKendall's τSpearman's ρScatter plotGraphicsBar chartBiplotBox plotControl chartCorrelogramFan chartForest plotHistogramPie chartQ–Q plotRadar chartRun chartStem-and-leaf displayViolin plotData collectionStudy designEffect sizeOptimal designPopulationReplicationSample size determinationStatisticStatistical powerSurvey methodologySamplingClusterStratifiedOpinion pollQuestionnaireStandard errorControlled experimentsBlockingFactorial experimentInteractionRandom assignmentRandomized controlled trialRandomized experimentScientific controlAdaptive clinical trialStochastic approximationUp-and-down designsObservational studiesCohort studyCross-sectional studyNatural experimentQuasi-experimentStatistical inferenceStatistical theoryProbability distributionSampling distributionOrder statisticEmpirical distributionDensity estimationStatistical modelModel specificationLp spaceParameterlocationParametric family(monotone)Location–scale familyExponential familyCompletenessSufficiencyStatistical functionalBootstrapOptimal decisionloss functionEfficiencyStatistical distancedivergenceAsymptoticsRobustnessFrequentist inferencePoint estimationEstimating equationsMaximum likelihoodMethod of momentsM-estimatorMinimum distanceUnbiased estimatorsMean-unbiased minimum-varianceRao–BlackwellizationLehmann–Scheffé theoremMedian unbiasedPlug-inInterval estimationConfidence intervalLikelihood intervalPrediction intervalTolerance intervalResamplingJackknifeTesting hypotheses1- & 2-tailsUniformly most powerful testPermutation testRandomization testMultiple comparisonsParametric testsLikelihood-ratioScore/Lagrange multiplierSpecific testsZ-test (normal)Student's t-testF-testGoodness of fitChi-squaredG-testKolmogorov–SmirnovAnderson–DarlingLillieforsJarque–BeraNormality (Shapiro–Wilk)Likelihood-ratio testModel selectionCross validationRank statisticsSample medianSigned rank (Wilcoxon)Hodges–Lehmann estimatorRank sum (Mann–Whitney)Nonparametric1-way (Kruskal–Wallis)2-way (Friedman)Ordered alternative (Jonckheere–Terpstra)Van der Waerden testBayesian inferenceBayesian probabilityposteriorCredible intervalBayes factorBayesian estimatorMaximum posterior estimatorCorrelationRegression analysisPearson product-momentConfounding variableCoefficient of determinationErrors and residualsRegression validationMixed effects modelsSimultaneous equations modelsMultivariate adaptive regression splines (MARS)Linear regressionSimple linear regressionOrdinary least squaresGeneral linear modelBayesian regressionNonlinear regressionSemiparametricIsotonicRobustHomoscedasticity and HeteroscedasticityGeneralized linear modelExponential familiesLogistic (Bernoulli)BinomialPoisson regressionsPartition of varianceAnalysis of variance (ANOVA, anova)Analysis of covarianceMultivariate ANOVADegrees of freedomCategoricalMultivariateTime-seriesCohen's kappaGraphical modelLog-linear modelMcNemar's testCochran–Mantel–Haenszel statisticsRegressionManovaPrincipal componentsCanonical correlationDiscriminant analysisCluster analysisClassificationStructural equation modelFactor analysisMultivariate distributionsElliptical distributionsNormalDecompositionStationaritySeasonal adjustmentExponential smoothingCointegrationStructural breakGranger causalityDickey–FullerJohansenQ-statistic (Ljung–Box)Durbin–WatsonBreusch–GodfreyTime domainAutocorrelation (ACF)partial (PACF)Cross-correlation (XCF)ARMA modelARIMA model (Box–Jenkins)Autoregressive conditional heteroskedasticity (ARCH)Vector autoregression (VAR)Frequency domainSpectral density estimationFourier analysisLeast-squares spectral analysisWaveletWhittle likelihoodSurvivalKaplan–Meier estimator (product limit)Proportional hazards modelsAccelerated failure time (AFT) modelFirst hitting timeHazard functionNelson–Aalen estimatorLog-rank testApplicationsBiostatisticsBioinformaticsClinical trialsstudiesEpidemiologyMedical statisticsEngineering statisticsChemometricsMethods engineeringProbabilistic designProcessquality controlSystem identificationSocial statisticsActuarial scienceCensusCrime statisticsDemographyEconometricsJurimetricsNational accountsOfficial statisticsPopulation statisticsPsychometricsSpatial statisticsCartographyEnvironmental statisticsGeographic information systemGeostatisticsKriging