Here is a nice 1700-word Q-and-A on technical points of statistical theory by Cosma Shalizi:

Q: What is a statistical parameter?

A: The fundamental objects in statistical modeling are probability distributions, or random processes. A parameter is a (measurable) function of a probability distribution; if you want to be old-fashioned, a “functional” of the distribution. For instance, the magnitudes of various causal influences (“effects”) are parameters are parameters of causal models.Think of these distributions as being like geometrical figures, and the parameters as various aspects of the figures: their volume, or area in some cross-section, or a certain linear dimension.

Q: So I’m guessing that whether a parameter is “identifiable” has something to do with whether it actually makes a difference to the distribution?

A: Yes, specifically whether it makes a difference to theobservablepart of the distribution.

Q: How can a probability distribution have observable and unobservable parts?

A: We specify models involving the variables we think are physically (biologically, psychologically, socially…) important. We don’t get to measure all of these. Fixing what we can observe, each underlying distribution induces a distribution on the variables we do measure, the observables. In the analogy, we might only get to see the shadows cast by the geometric figures, or see what volume they displace when submerged in water.

Q: And how does this relate to identifiability?

A: Every (measurable) functional of the observable distribution is identifiable, because, in principle, what we can observe gives us enough information to work it out, oridentifyit. Every parameter of the underlying distribution which is not also a parameter of the observable distribution is unidentifiable, or unidentified.In the analogy, if we know all the figures are boxes (i.e., rectangular prisms), but we only get to see their displacement, then volume is identifiable, but breadth, height and width are not. It is not a matter of not having enough data (not measuring the displacement precisely enough); even knowing box’s volume exactly would not, by itself, tell us the height of the box.

Q: Are all identifiable parameters equally easy to estimate?

A: Not all. For real-value parameters, the natural quantification of identifiability is the Fisher information, i.e., the expectation value of the second derivative of the log-likelihood with respect to the parameter. (In general the first derivative is zero.) But this seems like, precisely, a second-order issue after identifiability as such. Of course, if a parameters is unidentifiable, the derivative of the log-likelihood with respect to it is zero. But at this point we are leaving the clear path of identifiability for the thickets of estimationtheory, and had better get back on track.

Q: So is identifiability solely a function of what’s observable?

A: No, it depends on the combination of what we can measure and what models we’re willing to entertain. If we observe more, then we can identify more. Thus if we can measure the volume of a box and its area in horizontal cross-section, then we can identify its height (but not its breadth or width). But likewise, if we can rule out some possibilitiesa priori, then we can identify more. If we can only measure volume, but know the box is a cube, then we can find height (and all its other dimensions). Of course we could also identify height from volume and the assumption that the proportions are 1:4:9, like the monolith in 2001.

Q: I get why expanding the observables lets you identify more parameters, but restricting the set of models to get identification seems to have “all the benefits of theft over honest toil”. Do people really report such results with a straight face?

A: Identifying parameters by restricting the models we entertain is just as secure as those restrictions. If we have good actual reasons for the restrictions, then it would be silly not to take advantage of that. On the other hand, restricting modelssimplyto get identifiability seems quite contrary to goals of science, since it is as important to admit what we do notyetknow as to mark out what we do. At the very least, these are the sorts of hypotheses which need to be checked — and which must be checked with other or different data, since, by non-identifiability, the data in question are silent about them. (If you are going to assume all boxes are cubes, you should check that; but looking at their volumes won’t tell you whether or not they are cubes. That data is indifferent between your sensible cubical hypothesis and the idle fancies of the monolith-maniac.)

Q: Couldn’t we get around non-identifiability by Bayesian methods?

A: Expressing “soft” restrictions by a prior distribution about the unidentified parameters doesn’t actually make those parameters identified. Suppose, for instance, that you have a prior distribution over the dimensions of boxes,p(B,H,W). The three parametersB,H,Wcompletely characterize boxes, and in this are equivalent to the three parameters of volumeV=BHWand the two proportions or ratiosh=H/Bandw=W/B. Thus the priorp(B,H,W) is equivalent to an unconditional prior on volume multiplied by a conditional prior on the proportions,p(V)p(h,w|V). Since the likelihood is a function ofValone, Bayesian updating will change the posterior distribution over volumes, but leave the (volume-conditional) distribution over proportions alone. This reasoning applies more generally: the prior can be divided into one part which refers to the identifiable parameters, and another which refers to the purely-identifiable parameters, and learning only updates the former. (If a Bayesian agent’s prior prejudices happen to link the identified parameters to the unidentified ones, its convictions about the latter will change, but strictly through those prior prejudices.) The prior over the identifiable parameters can and should be tested; that over the unidentified ones cannot. (Not with that data, anyway.)

Q: If a parameter is unidentified, why bother with it at all? Why not just use Occam’s Razor to shave them away?

A: That seems like an excess of positivism. (And I say this as someone who is sympathetic to positivism.) After all, which parameters are identifiable depends on what we can observe. It seems excessive to regard boxes as one-dimensional when we can only measure displaced volume, but then three-dimensional when we figure out how to use a ruler.

Q: Still, shouldn’t there be a presumption against the existence or importance of unidentifiable parameters?

A: Not at all. It is very common in politics to simultaneously assert that the electorate leans towards certain parties in certain years; that peoplebornin certain years have certain inclinations; and that people’s political inclinations go through a certain sequence as they age. If we admit all three kinds of processes, we have to try to separate the effects on political opinions of people’s age, the year they were born (their cohortperiod). The problem is that (e.g..) everyone who will be 45 years old in 2012 was born in 1967, so there is no way to separate the effects of being 45 years old in 2012 (age+period) from being born in 1967 (cohort). Anytwoof the effects of age, period and cohort are identifiable if we rule out the thirda priori; if we allow that all three might matter, we are not able to identify their effects.

Q: I fail to see how this isn’t actually an example in favor of my position — people think these are three different effects, but they’re just wrong.

A: We can break this sort of impasse by specifying more detailed mechanisms (and hoping we get more data). For instance, suppose that people tend to become more politically conservative as they age, but that this is because they accumulate more property as they grow older. Then, with data on property holdings, we could separate the effects of cohort (were you born in 1967?) and age (are you 45?) from period (are you voting in 2012?), because aging influences political opinions not through a mysterious black box but through an observable mechanism. Or again, there are presumablymechanismswhich lead to period effects, as in Hibbs’s“Bread and Peace” election model. (Even if that model is wrong, it illustrates thekindof way a more elaborate theory can bring evidence to bear on otherwise-unidentifiable questions.) Of course these more elaborated, mechanistic theories need to be checked themselves, but that’s science.

Q: So, what does all this have to do with the social-contagion debate?

A: What Andrew Thomas and I showed is that the distinction between the effects of homophily and those of social influence or contagion is unidentifiable in observational (as opposed to experimental) data. This, to my way of thinking, is a much more consequential problem for claims that such-and-such a trait is socially contagious than doubts about whether this-or-the-other significance test was really appropriate; it says thatthe observational data was all irrelevantto begin with. Instead, trying to attribute shares of the similarity between social-network neighbors to influence vs. pre-existing similarity is just like trying to say how much of the volume of a box is due to its height as opposed to its width — it’s not really a question datacouldanswer. Itcouldbe that we could use other evidence to show that most boxes are cubes, but that’s a separate question. No amount of empirical evidence about the degree of similarity between network neighbors can tell us anything about whether the similarity comes from homophily or influence, just as no amount of measuring the volume of boxes can tell us about their proportions.

Q: Mightn’t there be assumptions about how social influence works, or how social networks form, which let us estimate the relative strengths of social contagion and homophily?

A: There might be indeed; we hope to find them; and to find external checks on such assumptions. Discovering such cross-checks would be like finding ways of measuring the volume of a geometrical bodyandand its horizontal cross-section. Andrew and I talk about some possibilities towards the end of our paper, and we’re working on them. So I’m sure are others.

Q: I find your ideas intriguing; how may I subscribe to your newsletter?

A: For more, see Partial Identification of Parametric Statistical Models; my review of Manski on identification for prediction and decision; and Manski’s book itself.

Glad you liked it; but s/Cosmas/Cosma/.

Thanks for your comment Cosma. I liked it indeed 😉

Apologies for having misspelled your name. I guess this happened because in my native language there is an s (or rather an ς/Σ) at the end…

No worries. “Kosmas” was definitely the original form, but somehow the final consonant got dropped in the passage from Greek to Italian…