Project Proposal

Background & Significance


Introduction

By remarkable serendipity, complete machine-readable population censuses of five North Atlantic countries in the late nineteenth century will soon become available for social science research. The Church of Latter-Day Saints (LDS), in collaboration with local genealogical societies, laboriously digitized three of these censuses—for Britain, Canada and the United States—to provide a resource for genealogical research. That massive project involved some 4.8 million hours of work by thousands of volunteers and professionals, and resulted in a verified transcription of the census information on the approximately 85 million individuals who resided in those countries in 1880 or 1881.

The copyright for the British data is held by Her Majesty's Stationery Office, and the data were produced with the cooperation of the Public Record Office and The General Register Office for Scotland. Consequently, the completed dataset is the property of the British Government. The History Data Service of the UK Data Archive has recently obtained authorization from both London and Edinburgh to distribute the datasets to academic researchers. In the U.S. and Canadian cases, the transcriptions are copyrighted by the LDS. In the past twelve months, the Minnesota Population Center and the Institute for Canadian Studies at the University of Ottawa have both negotiated agreements with the LDS allowing us to freely distribute the data to academic researchers in exchange for cleaning the data.

The Norwegian and Icelandic cases are somewhat different. Over the past two decades, Norwegian researchers have invested more than half a million hours in digitizing historical population records. The national censuses of 1865 and 1900 are now complete, and the census of 1875 is well underway. Although the primary use of these materials to date has been for genealogical purposes, they were envisioned from the beginning as a source for social science research. The database is a collaborative product of the Norwegian Historical Data Centre (Tromsø) and the Digital Archive of the Norwegian National Censuses (Bergen). In Iceland, the censuses of 1860, 1870 and 1901 have been transcribed as part of an effort to construct genealogies for genetic research, and work is underway on the censuses of 1880 and 1890.

The result of all these labors is a transcription of the characteristics of 90 million persons who resided on the North Atlantic rim in the late nineteenth century. The census in each case provides information on age, sex, marital status, family relationships, occupation and birthplace, and allows the construction of a full complement of variables describing household composition, fertility, and neighborhood and community characteristics. In their present form, however, these data are of minimal use for social science research. There are literally millions of occupational titles, birthplaces, family relationships and geographic localities transcribed in four different languages. Before any of these data can be fully exploited, each variable must be numerically coded and classified. Efforts are already planned or underway in each country to carry out such classification. This proposal seeks funds to coordinate our work, so that we will be able to pool the datasets and carry out cross-national analyses of the North Atlantic population.

Initial discussions about the potential for creating an integrated database for the entire populations of the United States, Britain and Canada occurred in Ottawa in April 1999, at a meeting of the International Microdata Access Group (IMAG). The Norwegian and British projects were already underway, and participants from those countries described their plans for converting their data into a form usable by social scientists. The U.S. and Canadian participants had just learned of the existence of the LDS transcriptions of data from the censuses of 1880 and 1881, and all immediately realized the potential for a powerful integrated social science database.

During the course of the next year, the Canadian and U.S. groups obtained permission from the LDS to disseminate the data and raised external funds to process the datasets. In June and October 2000 participants from each country met in Minneapolis to define the goals of the project and develop a detailed plan of work. The participants agreed that we should not simply create compatible datasets, but rather should develop a single fully integrated database with common coding systems, constructed variables, documentation and dissemination systems. We agreed that this ambitious plan for international collaboration would require additional funding.

The collaborators on this project have pieced together funding from numerous sponsors in four countries to support the painstaking tasks of data cleaning and coding. In Britain, the funders include the Economic and Social Research Council, the Leverhulme Trust, and the Essex University Research Promotion Fund; in Canada, the Social Sciences and Humanities Research Council, the Harold Crabtree Foundation, the Church of Jesus Christ of Latter-Day Saints, and the University of Ottawa Research Partnerships Programme; in Norway, the Norwegian Research Council, the Norwegian National Archives and the Faculty of Social Sciences of the University of Tromsø; and in the United States, the National Science Foundation and the National Institutes of Health. A comparatively modest infusion of support for international collaboration will leverage this investment and allow us to create an extraordinary resource for comparative social and economic research.

We envision the North Atlantic Population Project as the foundation for a long-term collaborative enterprise to reconstruct the population of this region from the mid-nineteenth century to the present. Table 1 describes the surviving individual-level censuses for each country. The five countries involved in this collaboration are fortunate to have extraordinarily rich collections of surviving individual-level census data; indeed, these are probably the five best-endowed nations of the world in this respect. Although many countries undertook censuses in the late nineteenth century, in almost all other cases the individual-level enumerator’s returns were destroyed or lost. In the long run, we envisage a series of complete census transcriptions for each country. This would allow a longitudinal perspective on the North Atlantic population as it underwent industrialization, urbanization and demographic transition. Our proposed collaboration is an essential first step, and will establish the standards for future expansion of the database.

Advantages of a complete-count census database

Usable national census samples already exist for late-nineteenth century Britain, Canada and the United States; these samples are identified with an "S" in Table 1. The proposed database, however, will be far more powerful than these existing resources. The availability of information on the entire population will open important new avenues of research in all five countries. The paragraphs that follow describe some of the new methodological approaches that will be possible with complete count data.

Study of small, dispersed population subgroups.The minimum acceptable number of cases for census data is substantially larger than is needed for a typical rectangular sample survey. For many topics of study, the relevant individuals for analysis are a small subset of the sample population. For example, fertility analysis is ordinarily limited to the population of women 15 to 49 years old and studies of occupational structure are restricted to the employed population. The precision of census samples is further limited because such samples are invariably clustered by households.

Table 1. Availablity of North Atlantic Census Microdata, 1840-2000


1840  1860  1880  1900  1920  1940  1960  1980  2000 
Britain   M   S   M   M   C   M   P   M   M   M   M   M   M   Z   Z   S   Z      
Canada               S   C   Z   S   M   M   M   M   M   M   S S S S S S Z      
Iceland C M C M M C   C   P   P   C   P   M   C   M   M   M       S              
Norway             C   P     M   C   M   M   M     M M   S   S   S   S   Z      
United States       S   S   S   C       S   S   S   Z   S   S   S   S   S   S   Z      

Key

M Manuscript individual-level census survives
S Machine-readable national sample of census exists
C Complete machine-readable transcription exists
P Complete machine-readable census transcription planned or in progress
Z machine-readable national census sample planned

Some of the most important variables for analysis—such as region, size of locality, household structure and race—are perfectly ornear perfectly homogeneous within clusters. When analyzing such characteristics, the number of households rather than the number of individuals determines sample precision (Ruggles 1995a). The number of cases needed to analyze a population subgroup depends on the type of subgroup, the type of analysis, population heterogeneity and desired precision. If high precision estimates are required, many thousands of cases of the subgroup of interest may be necessary.

Many small population subgroups—defined by race, ethnicity, occupation or even age—can only be studied with very high-density datasets. For example, the availability of complete-count data will allow study of the indigenous populations of Canada, Norway and the United States. The new database will even include sufficient indigenous women of childbearing age to allow in-depth fertility analysis. Similarly, existing sample data are insufficient to study immigrant groups in detail. A substantial proportion of the Icelandic population emigrated to Canada and the United States, but there are insufficient Icelandic cases in any of the existing samples to allow quantitative analysis. The number of Norwegians is larger, but is still insufficient for detailed analysis. The new database will also be sufficiently large to compare specific occupations across all five nations, such as sailors and fishermen, and will for the first time allow comparative study of centenarians.

Community studies. The community study has been one of the most fruitful analytical approaches in both history and sociology. The existing sample datasets lack sufficient cases to examine particular localities. Because it includes the entire population, the new database will allow historians and sociologists to extract customized datasets focusing on particular communities. The international dimension of the database will allow investigators to undertake comparative community studies. For example, an investigator could compare patterns in a Minnesota Norwegian community with the sending community in Norway. There is a large historical demand for local statistical data, and the North Atlantic database will immediately become an essential tool for community historians of all sorts. Even historians who make little use of quantitative analysis will be able to quickly and painlessly locate their study subjects in the manuscript census.

Longitudinal analysis. Perhaps the greatest limitation of the existing samples is that they are cross-sectional snapshots and do not allow one to trace individuals across time. This problem will be greatly alleviated by the new database. In Britain, Canada and the United States, there exist machine-readable samples of the census for multiple years. Thus, it will be possible to create a series of linked samples; in the case of Canada, for example, individuals in the 1871, 1891 and 1901 census samples can be linked to the complete-count 1881 census. Thus, researchers will be able to construct three linked Canadian samples, covering 1871-1881, 1881-1891 and 1881-1901. As shown in Table 1, in Norway, Iceland and Britain there are existing or in preparation complete censuses from multiple census years. These datasets offer the potential to link individuals across more than a single pair of census years. Researchers will even be able to link some individuals across countries, especially from Norway and Iceland in 1865 to the United States in 1880 and Canada in 1881.

Historians have been linking individuals across censuses for decades, but the results are problematic. In most cases, linked census studies have been based on local populations because no complete census for a larger area has been available. These studies generally lose between 60 and 80 percent of the population each decade due to linkage failures (see for example Katz 1975, Knights 1991, Thernstrom 1964, Guest 1987, Ferrie 1996). Most linkage failure is attributable to the very high migration characteristic of the mid-nineteenth century. The availability of high-quality census files including entire populations will allow far more sophisticated matching than has previously been possible. Using the new database, for example, entire countries can be searched using characteristics such as age, sex, birthplace, birthplace of mother, and birthplace of father as well as name. The new database will allow a far higher rate of matches than have previous studies and will be able to provide samples thousands of times larger. Moreover, because the analyses will be based on representative populations at both ends of the record linkage, any biases in the linked population will be readily detectable.

Linked census data holds the promise of finally resolving some of the longest-running debates in nineteenth-century social history. Past studies of social and geographic mobility were ultimately inconclusive because of their exclusion of migrants and their small sample size. Scholars will be able to gauge the extent of social and geographic mobility, analyze the interrelationship of geographic and economic movement, and assess trends and differentials in social mobility far more reliably than heretofore (Thorvaldsen 1995). In addition, the linked samples will allow investigation of questions regarding family formation and dissolution. For example, they will allow us to answer several controversial questions surrounding the formation of multigenerational households in the nineteenth century (Ruggles 1994a, 2000).

Multilevel analysis. In recent years, multilevel analyses of the effects of local context on individual behavior have proven exceedingly valuable tools for research in historical sociology (see for examples Elman 1998; Kramarow 1995; Ruggles 1997a, 1997b). A key problem for such nineteenth-century research, however, is that the method requires independent variables tabulated for small geographic units, and such data are scarce before the twentieth century. The new North Atlantic sample will allow creation of a wide variety of contextual variables—such as racial or ethnic composition, female labor-force participation, and occupational structure—at any geographic level, including the block, the neighborhood, and the enumeration district.

Geographic Information Systems. Geographers are ordinarily unable to tap the power of microdata. The existing nineteenth-century microdata files are samples, so when they are used for small areas they provide insufficient precision for reliable mapping. Although some relatively high-density samples are available for the period since 1970, those microdata files suppress detailed geographic data. Therefore, geographers are forced to rely on complete count aggregate data that usually provide only basic summary statistics for small areas.

The North Atlantic census database will provide full geographic detail for every individual in the population. Digitized small-area boundary files are already in preparation for nineteenth-century Norway and Britain, and a pending proposal to the National Science Foundation would provide a similar resource for the United States. Thus, there is already a large scholarly investment in nineteenth-century geographic information systems. What is lacking is a fine level of geographic detail in social, economic, and demographic characteristics. The North Atlantic census database will allow scholars to marry existing geographic boundary files to population characteristics, thus creating a powerful new analytic tool. Such fine geographic analysis will be especially potent in the analysis of topics such as early suburban development and racial and ethnic residential segregation (see for example Gardner 1998).

Substantive Research Areas

A cross-section encompassing the entire population of the North Atlantic world in the late nineteenth century will open up vast new terrain in the fields of history, economics, demography, and sociology. The censuses include a great deal of information on demography and social structure that can only be taken advantage of through the creation of a new microdata set. The late nineteenth century is a critical period in the study of fertility decline, urbanization, international migration, household composition and occupational structure. The database will allow the construction of cross-tabulations on a wide range of topics that were not covered by census publications or were incompletely tabulated. Perhaps even more important is the potential for longitudinal and multilevel multivariate analyses opened by the availability of the database. The North Atlantic census database will not only constitute an invaluable resource in its own right, but will also enhance the value of the previously created historical microdata samples. Used in combination these microdata will constitute our most important resource for the study of nineteenth-century social structure.

A full discussion of the specific topics that could be addressed with a complete machine-readable database of the nineteenth century censuses of five countries would require many pages. The paragraphs that follow sketch only a few of the most obvious research applications of the new database.

Industrialization. The first Industrial Revolution may have begun in Lancashire, but by the late nineteenth century, the entire North Atlantic world was involved in manufacturing, the production of raw materials, or both. The North Atlantic database will allow unprecedented opportunities to explore economic structures within and between each nation during this critical transitional period.

For the first time, we will have consistently coded occupational data available for multiple nineteenth century countries, and it will be available at the individual level for the entire population. This will allow comparative analysis at the level of persons, families, communities or regions, and investigation of the geographic organization of economic activity. In four of the five countries, for example, mechanized textile manufacturing existed, and the census provides sufficient occupational detail to analyze the organization of the industry in each locality. All five nations were deeply involved in and interconnected by maritime industries. They competed in the rich North Atlantic fishery and in the transatlantic shipping trade. The North Atlantic database will not only reveal the structure of maritime industries, but also will allow the comparative investigation of maritime communities.

Fertility transition. At the time these censuses were taken, each of the North Atlantic countries was just beginning deliberate fertility limitation. The North Atlantic database will allow study of differential fertility patterns in this critical period of demographic transition, to assess the importance of such factors as occupational class, ethnicity, region, literacy, local economy, size of locality and family structures. Study of this elemental shift in population structure has the potential to enhance our understanding of ongoing demographic change in the contemporary developing world.

Past comparative analyses of the European fertility transition have relied on aggregate vital statistics (Coale and Watkins 1986). This approach has two major disadvantages. First, aggregate vital statistics do not allow direct measures of child spacing or stopping behavior; only the level of fertility can be considered. Second, the aggregate approach does not allow control of individual-level socioeconomic characteristics.

The new database will allow analysis of fertility differentials through own-child methods (Cho, Retherford and Choe 1986). Own-child methods of fertility analysis require very large datasets, and are therefore especially well suited to complete population databases. Thus, the database will allow a new and more subtle generation of comparative studies of the first demographic transition.

Household and family composition. For more than a century, political theorists, sociologists and historians have been debating the relationship between industrialization and the family. In the 1970s, a series of British, Canadian and American studies argued that the harsh economic conditions of early industrial capitalism strengthened the interdependence of family members and led to a high frequency of complex households (Anderson 1972; Hareven 1978, 1982; Katz 1975; Foster 1974; Modell 1978). Each of these analyses focused on a single industrializing community, and so were unable to test the proposed association between industrial development and family or household composition.

In recent years, there have been numerous national and regional studies of family composition in the late nineteenth century based on sample data, but few have incorporated community-level economic measures (Sogner 1990, 1998; Gunnlaugsson and Garðarsdóttir 1996; Dillon 1997, 1998, 2000; Ruggles 1994b, 2000; Wall 1995). Comparisons across national boundaries have also been inhibited by inconsistencies in the construction of measures of household composition. Thus, there is presently little agreement about national similarities and differences in family and household composition in the late nineteenth century. Some of the most promising recent work has focused on relatively small population subgroups, such as the living arrangements of the aged or of unmarried mothers of young children, but only the largest samples are capable of supporting such investigations.

The North Atlantic database will include a common set of constructed variables to aid in the analysis of family and household composition and will thus allow consistent comparisons across all five countries. It will allow investigators to assess the impact of local context on family systems through multilevel analysis, and thus for the first time permit analysis of the effects of individual-level factors, local economic conditions, regional inheritance systems, and national characteristics on the nineteenth-century family.

International migration. The late nineteenth century saw international population movements on an unprecedented scale. The massive North Atlantic migration profoundly shaped both the receiving and contributing countries. The great majority of emigrants from Norway, Iceland and Britain went to Canada and the United States, and the influx transformed North American society. Many of these newcomers remained only a few years before returning to their homelands, often bringing home money and always bringing new ideas and experiences (Runblom and Norman 1976; Nugent 1992; Gjerde 1992; Thorvaldsen 1997).

The North Atlantic database will be a wonderful resource for the study of migration history. It will allow close and consistent comparisons of occupational structure, marriage patterns, fertility and family composition. Researchers will be able to identify and compare specific sending and receiving communities. In some instances, it will even be possible to follow individual migrants across the Atlantic and back again. In combination with new machine-readable ship lists and emigration registers, the database will open a new window on the implications of international population flows.

Educational applications of the database

In addition to scholarly research, we anticipate that the new database will make important contributions to teaching in the social sciences, helping to bring the excitement of discovery into the classroom. The detailed geographic analysis made possible by the new database makes it a suitable vehicle for introducing a quantitative dimension into secondary, undergraduate and graduate courses focusing on local history. Once the North Atlantic database is created, we plan to collaborate in the development of web-based instructional materials that capitalize on the fine detail available for local areas and small population subgroups.