Conference of Users of Official Statistics 11-12 November, Wellington, 1979?
I suspect I was invited to give this paper as one of the research workers in New Zealand most dependent upon official statistics. From one point of view then, I am experienced over a very wide range of official statistics and the problems which confront research users. From another point I am unlikely to bite the hands that feed me. Indeed how can I be less than generous to the providers of official statistics who over the years have taken much time and trouble to assist me?
They will report that I lead a fairly chaotic research life, and I move from one research topic to the next disparate one, often in a matter of hours. But for today I shall try to provide some order to my experiences by examining the following topics; documentation, access, aggregation and confidentiality, consistency, timing and development.
The choice of documentation as the first topic rests on the principle that effective use of statistics requires an understanding of their definition and construction.
For instance last week I came across the problem of what do we mean by “child”. There are two separate ideas in the concept – to quote the Pocket Oxford a “young human being” or “a son or daughter”. Thus in some circumstances the child may be an adult. The problem occurred in a measurement of solo parent families using the 1971 Census, which usually uses child to mean an (unmarried) son or daughter of any age. That has the unfortunate consequence that some of the solo family groups recorded in the Census are elderly widows and their adult unmarried children. I hardly need to remark that that is not the solo parent family concept that the user had in mind.
An experienced researcher soon learns to watch for these definitional traps, but most users are not so experienced and data documentation has to be presented with their needs in mind. The experienced researcher is more troubled where there is data construction.
For instance I have raised elsewhere the problem of how we measure the imputed rental on owner occupied dwellings. If our method is too conservative, we underestimate our Gross Domestic Product, our growth rate and the efficiency of our investment. I am not arguing that the official measurement procedure is wrong. Rather for other purposes we may wish to use a different method. In such circumstances a description of the official measurement method becomes crucial.
Such issues are current problems. They become more acute with time, and economic historian Brent Layton asked me to mention the importance of keeping good archives so that such issues can also be investigated in the future. I am a great believer in the importance of the historical perspective – if we had more we would be in less of a mess today. I understand that there is no official statistical archivist. May I suggest that despite the sinking lid on public service appointments, the time may be particularly propitious to establish such an archivist. After all it may be some time before once more the Minister of Statistics is also a history graduate.
Traditionally the main access the research worker had to official statistics was through official publications. Perhaps the most important is the “New Zealand Official Year Book” even though its data may be up to eighteen months behind the latest figures available. I use the Year Book extensively, but despite its strengths it has a major deficiency for a statistician in that it gives little guidance where else to go. May I plead that all its tables should include the source from which they are derived and each chapter should conclude with a section referring the reader to more comprehensive data sources.
However, in the future we may expect greater use of direct access to the data files – a facility of particular value in terms of time, coverage, and ease of use when data from the main file can be read off into a working file of the user. I know of CISS, the Central Index of Statistical Series, and I understand that there are discussions on making the Index available to non-official users, perhaps through on-line access. I am sure all research workers will applaud such a development.
I am not familiar with the details of CISS to comment further. It now forms the basis of the “Monthly Abstract of Statistics” which, once sufficient on line terminals become available, may gain a new role since access to the latest monthly data will be quicker and more comprehensive through CISS. Many monthly tables will in future need publication only once a year. Perhaps the new role of the MAS will include publication of documentation and research developments and of the semi-official4 statistics I shall be referring to later.
CISS is a much bigger data bank than is available in current publications. Public access to it will make redundant my point that research workers often need more detailed data than is published. I have been fortunate to have had the co-operation of the Department of Statistics in providing me with their detailed figures for some purposes. I hope that this privilege can be extended to more researchers, with less hassles for the Department.
One area which I doubt that CISS will contribute to is access to cross tabulations, but I shall deal with that in the next section.
I complete this section by reference to the access problems of research workers outside Wellington. We are fortunate in Christchurch to have a large office of the Department of Statistics. Research workers in other centres who are not so fortunate (or in the case of the statistics collected by other agencies) can have major frustrations in trying to get appropriate data or explanations. Perhaps one should remind Wellingtonians that most research is done in other centres. The role of Wellington seems to be to generate the research problems.
Aggregation and Confidentiality
Every research worker requires his or her own level of aggregation. Ideally the data should be available at the most disaggregated level in a computer file, with the researcher carrying out the aggregation when calling up the data. Practically the effort of constructing such finely disaggregated data can be resource consuming and problematic. Moreover disaggregation.generates problems of confidentiality. Energy Engineer, John Peet, remarked to me recently that he had actually lost information in the process of disaggregation because so much of his crucial production data had those frustrating asterisks which footnote to “suppressed to avoid disclosure of confidential information”. The problem must become acute for regional scientists.
There cannot be any general approach to this problem. I must confess I favour much greater disclosure of information by companies, particularly monopolies, but that is a question of commercial policy.
In the mean time I wonder whether in some cases the Statistics Department might approach the monopoly and ask whether they would permit publication of the relevant data. Such an instance would be Petrocorp, sole supplier of Natural Gas.
Another approach is to construct semiofficial estimates of the obscured data. This is what researchers like John Peet do anyway, using published data to construct plausible estimates of the unknown statistics. Such semiofficial statistics could be calculated by the statisticians doing the official statistics, by an outside contractor (I will give almost an illustration of this later), or by some research worker in the course of his work. They would be published with the official data, but with an indication to show they were derived a different way and less reliable.
A third approach is to give research workers a sample of the data with individual identification removed and allow them to work on it. For this to be successful the population has to be large and the sampling proportion small. Perhaps it is only appropriate for the Population Census, and currently the Cents Aid facility makes it possible for the research worker to examine large numbers of house-holds without in anyway infringing the confidentiality of the Census Records. I hope we see Cents Aid as a start. It has only a limited number of inferential statistics options – I feel quite naked being unable to carry out multivariate analyses, such as regression. Another problem which I have hit is that the census household does not conform to the unit for taxation purposes, and it looks that I shall need a more direct access to the sample files than the programs currently permit. We have also had processing problems. The logistics for those of us out of Wellington make it a very slow procedure, while being charged commercial computer rates loses the University social scientists one of
the few financial advantages we have. I hope serious consideration will be given to making sample files of the Census records, suitably masked to protect confidentiality available to approved researchers.
But none of these approaches are going to solve a problem I met a fortnight or so ago, which involved a tabulation from the annual farm returns which is not published. I was in a hurry so I did a patch job from existing data, but if the demand had not been urge might I have been able to approach the Department of Statistics for the required tabulation? I suspect they would have been co-operative but let us go a step further. In those annual farm returns we have a marvelous data bank with which multivariate statistical techniques could no doubt produce some fascinating results. In this case a one percent or even ten percent sample of the 70,000 odd farms would not be particularly useful given till enormous variation within the population. We would have to use all the returns. Is there a way we can do this?
There is a continuing tension between confidentiality and aggregation. I respect the need for confidentiality, particularly where individuals are concerned. Moreover confidentiality is often a prerequisite for the co-operation required for providing the data. But we must also be careful that we do not waste all that co-operation by failing to use the data effectively.
I acknowledge that little in this section may be applicable to regional statistics.
Economic and social change plus that phenomenon which scientists call progress means that over time data definitions have’ to be changed and old statistical series replaced by new ones. Sometimes the enthusiasm for innovation can destroy the usefulness of large quantities of data, as when the researcher cannot get a comparison between the new and old series. The most notorious example is surely the changes in occupational classification in the 1961 and the 1971 Censuses, which devalued the usefulness of past censuses and inhibited any investigation of the changing occupational structure in post war New Zealand.
Ideally research workers would like at least one year dat in the old and new format. A good example of this is new SNA national accounts commence from 1971/72 while those based on th old conventions finish in 1975/76 giving us five common years of the two accounts. (I understand that the Department of Statistics is also assisting the production of a semiofficial set of SNA national accounts back a further ten years, again an approach to be applauded).
More generally, might I suggest that before a change is may consideration be given to the consequences upon past data? If the change is really that valuable it will be worth our while ti put the extra effort into providing the means to lap the old any new data on an official or semi-official basis.
A related problem on the introduction of a new series is C it takes time to gain familiarity with it. Special papers, presentations at conferences, and seminars are all helpful in that respect.
But even so we may still get a problem, which I illustrate with a case I have not investigated to any detail. The general price index, that replacement to the wholesale price index, is a useful step forward in a number of ways. It was commenced in the December Quarter 1977 so we have less than three years o the series. That means, as so often happens with the introduction of new series, that the traditional techniques of multivariate regression for relating the series to, say, the consumer price index is not yet possible. What is that relationship? How long do we have to wait before we can use the GPI as a lead indication of the CPI? Am I right in my perception that the traditional relationship between the Wholesale Price Index and the CPI has changed? In the interim only those with direct access to the data base which created the indices – that is the officials in in the Department of Statistics – can offer any real indication to the answers of such questions.
As much as I favour innovation, experience with the problems of switching from old to new bases means I must advocate caution.
There is inevitably a delay between an event and the publication of its statistical record. However, sometimes the delay becomes excessive. Our last labour force figures are for 1977. We are told the estimates are being revised. That may well be so, but there is considerable concern about the state of the current labour market, and there can be little doubt that it has changed dramatically since the end of 1977.
It is possible to use existing data to construct a fair approximation to the true labour force figure. Here is a good place for the use of a semi-official statistic.
An illustration of this approach applies to the National Accounts which inevitably come out with a long delay. In the interim the New Zealand Institute of Economic Research provides estimates – in effect semi-official statistics.
I suspect that there is a need for a broad discussion about the timeliness of existing statistics, particularly among macro-economists. Sometimes we will be able to find adequate proxy statistics. Sometimes there will be administrative reforms perhaps in conjunction with sampling procedures and or extra resources. And sometimes we will have to rely upon semiofficial statistics.
In addition I suggest that in each of the current reviews of official statistics areas there be a specific section concerned with the timeliness of the publication of the data.
One is tempted to list under the section head developments, a series of urgently required statistical series. But my list could claim no precedence ahead of many others, and instead I want to focus on the problems of identifying and developing new statistical needs.
One problem is undoubtedly the resource cost of development coupled with the statistical obt useness of politicians. It is easy for a politician to argue that we have functioned perfectly well for many years without a particular data base, and for him to ignore the difficulties, even blunders, that the lack of data has caused. Fifteen years ago statisticians were pleading for an annual household survey. But it was not until 1973/74 that the first survey was commissioned. Full development of the survey is not complete, but already it has had a major influence on the construction of the price index, income maintenance strategies and energy policy. Sometimes I think that when the politicians turn down a proposal we should say that they were making the same arguments against the household expenditure survey fifteen years ago, and if they insist that the arguments remain valid we will withdraw all access to the results based on the Household Expenditure Survey.
But there is a second problem tied up with the whole nature of research which compounds the logistics of the development of statistics. Significant research in any area involves a few pioneers on the frontier. By the time the conventional wisdom has recognized its significance, supported the development of the statistical basis, and that base is developed, many years will have passed.
If I may illustrate this with a small example of my own work. Whilst comparing the 1971 Census with the 1970/71 income tax returns I was struck by a large group of people who do not appear as working according to the Census, but worked according to the tax returns. The explanation is that they work during only part of the year but not in Census week. Overseas they are described as the “peripheral workforce” and they prove to be not only an important component of the adjustment mechanism for a labour market, but one of the interfaces between the economy and society. The conventional wisdom will be telling you this in a few years’ time. Our statistical knowledge of the peripheral workforce is rudimentary. It requires specific surveys of the sort done overseas. We are unlikely to have such a survey before the mid-1980s. Allowing for processing time and accumulation of a number of surveys it will be fifteen odd years between the data anomaly and its scientific resolution.
I am phlegmatic about such delays but they could be shortened by giving research workers a greater role in the data collection agencies. In advocating the appointing of more research workers in, say, the Department of Statistics and the extension of its activities into research, I do not want to undervalue the quality of the present staff nor their research work. I have had too many helpful discussions with departmental officers and attended too many memorable Departmental papers at economics, statistics, and demographic conferences ever to do that.
Rather I want to argue that an effective data collection agency needs also to have a strong involvement in the research uses of the data. There is often this involvement there, but it seems to me that it is undervalued, and should be promoted. As a University teacher I want my best graduates to be coming to me at the end of the year about careers advice and volunteering “What about the Statistics Department?” without my prompting. And I should like to be able to reply with more confidence about a career structure for social statisticians and quantitative social scientists. Some of the students’ textbooks and articles are written by employees of foreign government statistics departments. That is’ something we should aspire to here.
The advantage to research workers of such a development is two fold. It means we have kindred spirits more closely involved with data development and aware of our needs. The same people can also extend the informal consultancy services which the agencies provide.
In conclusion I must confess that the overall thrust of my remarks is that we should be spending more on the official data collection agencies. This comes from my belief that there is only one thing more expensive than research and that is no research, a point I would illustrate with some of the botch ups we have around us today.
Yet there is a sense in which my suggestions are cheaper than they may seem. Official statistics provide research workers with an enormous data bank which is being underutilized. That databank does not answer all our questions, but very often it will be more fruitful and less expensive to pursue the questions the data bank raises than to go to the trouble of creating new data sources. One-off small scale surveys are easy for research workers to do, but expensive, particularly when the results do not tie into the population in any meaningful way.
This country spends over $12m a year on its collection and preparation of official statistics. We should ensure the results are used as effectively as possible.