Secondary data analysis in educational research: opportunities for PhD students

. The article discusses the problem of using secondary data analysis (SDA) in educational research. The definitions of the SDA are analyzed; the statistics of journals articles with secondary data analysis in the field of sociology, social work and education is discussed; the dynamics of articles with data in the Journal of Peace Research 1988 to 2018 is conducted; the papers of Ukrainian conference “Implementation of European Standards in Ukrainian Educational Research” (2019) are analyzed. The problems of PhD student training to use secondary data analysis in their dissertation are discussed: the sources of secondary data analysis in the education field for Ukrainian PhD students are proposed, and the model of training of Ukrainian PhD students in the field of secondary data analysis is offered. This model consists of three components: theory component includes the theoretic basic of secondary data analysis; practice component contains the examples and tasks of using SDA in educational research with statistics software and Internet tools; the third component is PhD student support in the process of their thesis writing.


Introduction
In the modern digital globalized world, we see a large data flow from different sources and large datasets. That's why it's important to prepare future researchers for a secondary data analysis with new computer tools and technologies. Secondary data is collected by someone other than the researcher and with another purpose. During the secondary research authors may draw data from government documents, scientific papers, statistical databases and other sources.
The relevance of this direction is indicated by a number of initiatives. For example, The Secondary Data Analysis Initiative [1], developed in 2019, aims to deliver high-quality high-impact research through utilising existing data resources created by the ESRC and other agencies in order to address some of the most pressing challenges facing society.
Secondary data analysis is a promising area in the field of educational sciences, but it is scarcely presented in PhD research in the pedagogy field in Ukraine.

Problem definition
The purpose of the article is to establish the features of the secondary data analyses in educational research and how it is presented in scientific articles of authoritative journals, conference proceeding and program courses for PhD students.
J. Sobal discussed the problem of teaching secondary data in the field of sociology [2]. E. Smith analyzed the pros and cons of using secondary data analysis in educational research [3][4]. T. P. Vartanian presented advantages, disadvantages, feasibility, and appropriateness of using secondary data analysis with focus on social work [5].
"Practical Methods for Secondary Data Analysis" course program for students of School of Public Health (University of Minnesota) is presented in [6]. The course emphasizes practical approaches to pre-statistical data processing and analysis with Stata statistical software on a PC with a MS Windows operating system.
T. Logan recent work about practical iterative framework for secondary data analysis in educational research deserves attention [7].
V. Sherif discussed the problem of evaluation preexisting qualitative research data for secondary analysis [8]. M. P. Johnston describes secondary data analysis for qualitative and quantitative data in the field of libraries research [9].
The paper of J. Carter and others [10] focuses on the World Bank data and presents the usage of socioeconomic secondary data to develop quantitative skills of social science students in UK university.
Analysis of scientific sources shows that in Ukraine SDA is not sufficiently used in education in general, and in the training of Pedagogy majors PhD students in particular.

SDA methodology analysis
What is the definition and essence of secondary data analysis?
J. Sobal notes that any data which have been collected for "another purpose and later reanalysed may be seen as secondary data" [2, p.480]. P. Vartanian says, that "secondary data can include any data that are examined to answer a research question other than the question for which the data were initially collected" [5].
We agree with E. Smith and others, that secondary data analysis is a research methodology that has the potential to greatly impact greatly educational research [3]. We share also the opinion of J. Sobal that secondary data analysis, "the reanalysis of machine-readable data, is one of the great supplements to traditional teaching methods, especially for teaching research methodology and statistics" [2]. The training in using SDA is especially important for PhD students because they are preparing to become both researchers and university teachers.
There are different methods of using SDA. We can use SDA in isolation with the purpose of re-assessing data set with a new research question. The other path is the combination of two or more data sets for investigation of the relation between the variables in those data. We can also combine secondary data analysis with primary data analysis.
Secondary data can be numeric or non-numeric or qualitative data. Qualitative secondary data include data retrieved second hand from interviews, ethnographic accounts, photographs, documents, conversations and other.
The list of sources of numeric or quantitative data that are suited to secondary analysis would include: population census, government surveys, cohort and other longitudinal studies, administrative records and other regular or continuous surveys, university and college records, author websites and other.
Secondary data can be restricted or public; it can arise from direct (biomarker data) and indirect observation (self-report).
Analysis of scientific sources shows [11] that SDA is a wide field, related to literature search and Internet search, literature review, cross-national research, demographics data, qualitative and quantitative data analysis, comparative research etc. (Fig. 1).
The scientists presented a wide list of examples of large secondary datasets for educational and social sciences research [12]:  According to T. Vartanian, an excellent archive for educational datasets, is the International Archive of Educational Data [13]. Here, we will find datasets and online tools to examine a wide range of educational surveys.
We can add some Ukrainian resources for this list. The first one is the Ukrainian Center for Education Quality Assessment. It offers a service through which you can analyze the results of external independent evaluation, taking into account different indicators. There are data sets from 2015-2019 [14]. Our sociology students used this data to compare the ZNO results of their region with another region, Kyiv, all of Ukraine in social statistics classes and in course papers.
The second source we presented in our work [15]. We offer our PhD students the survey data from Ukrainian teachers [16-17] for analysis. In 2017, the Ukrainian Association of Educational Researchers conducted the All-Ukrainian monitoring "Teaching and Learning Survey on Principals and Teachers of Secondary Education Institutions" (based on the TALIS methodology [18]).  The third source is a population census in Ukraine. We use data bases that contain Ukrainian census data since 1959 [19]. For example, one of the tasks is related to building and comparing the gender-age pyramid of the population of Ukraine at different years and includes searching for the relevant, data, building the pyramid using standard diagram building Excel tools, using SPSS tools (Chart Builder, Histogram, Population Pyramid), and using pyramid package of R environment. The second task is related to the calculation of child care and grandparent care load coefficients, visualizing of their dynamics, and includes an introduction to the demographic passport of Ukraine [19].
In The disadvantages of using secondary data are: data may not facilitate particular research question; information regarding study design and data collection procedures may be scarce; data may potentially lack depth; may require knowledge of survey statistics and methods which is not generally provided by basic graduate statistics courses.
Scientists list [20] the following important steps in the teaching SDA.
1. Develop student's research question 2. Identify a secondary data set 3. Evaluate a secondary data set • What was the aim of the original study?
• Who has collected the data?
• Which measures were employed?
• When was the data collected?
• What methodology was used to collect the data?
• Making a final evaluation 4. Prepare and analyse secondary data. It is useful to correlate these steps with use SDA in isolation, with the combination two or more data sets and to combine secondary data analysis with primary data analysis.
What software is used for SDA? We can use the software specifically developed for analysing complex survey data [12]. It is generally free, but may lack flexibility and be only useful for initial data analysis. The examples of such tools are: PowerStats (http://nces.ed.gov/datalab/), Data Analysis System (DAS)(http://nces.ed.gov/das/), AM Statistical Software (http://am.air.org/). Also we can use general purpose software that can account for complex sampling. These tools are usually commercial and cost a lot. (except R). They are generally syntax-based, more flexible. Examples of such tools are: SAS (certain analyses require SUDAAN add-on), Stata, SPSS, Mplus and other.
In R environment there is a special package called "survey" [21]. The package is oriented on analysis of complex survey samples and provides the following features: summary statistics, two-sample tests, rank tests, generalized linear models, cumulative link models, Cox models, log linear models, and general maximum pseudo likelihood estimation for multistage stratified, clustersampled, unequally weighted survey samples. Also, we can use variances by Taylor series linearization or replicate weights, post-stratification, calibration, and raking. There are two-phase subsampling designs, graphics, PPS sampling without replacement; principal components, factor analysis. So, the students need substantial training in order to be able to use this package.
The next section discusses how the secondary data analysis application is displayed in the articles of scientific journals, as well as the maintenance of the article by data sets.

Presenting secondary data analysis and quantitative methods in the journal article
The British Scientist E. Smith [4] explores the use of quantitative methods in educational research and the use of numeric secondary data analysis.
She reviewed the published output of eight wellregarded journals in the fields of Education, Sociology and Social Work over a seven-year period ( About one quarter of all the papers (24 %) that were reviewed by E. Smith used some form of quantitative method, of these around 42% presented secondary data analysis. The use of quantitative methods changed from 31% of papers in the 'Education' journals, 27% in the 'Social work' journal, and 17% in 'Sociology' (Fig. 2).

Fig. 2.
Percent of papers with quantitative methods from total papers. Built by author with data from [4, p.327] Less than 10% of all papers reviewed involved some analysis of secondary data. In the 'Sociology' journals the majority (75%) of numeric papers did make use of secondary data, including the data from surveys such as the National Child Development Study, the British Family Resources Survey, the Labour Force Study and the European Values Survey. In 'Education' journals, 42% of the papers which used numeric methods involved the analysis of secondary data (Fig. 3).   Fig. 3. Percent of papers with secondary data analysis from paper with quantitative methods. Built by author with data from [4, p.327] The vast majority of articles made use of school performance data; some others authors used studies such as the Youth Cohort Study, the 1958 British Birth Cohort Study and administrative data produced by the Higher Education Statistics Agency [4].
We are going to perform a secondary statistical analysis for this data. The research question is: "Are publications of the three education journals significantly different in using SDA?" To compare the journals we used the statistical Fisher criterion  * , which estimates the significance of differences between the percentages of two samples that have an effect of interest to the researcher. The data for calculations for two journals are given in the Table 2. The empirical value of Fisher's criterion | * | is 0,403, which does not exceed the critical one 1,64, so these journals do not differ significantly in terms of the proportion of articles that use the SDA. Similar results were obtained when comparing the other two pairs of the educational journals.
We also analyzed the conference proceedings of UERA (Ukrainian Educational Research Association). The aim of the UERA is to promote the development of scientific competence of the researchers in Education field, to raise the quality of educational research in order to influence the educational system and the society (uera.org.ua). The discussion of Third UERA Conference "Implementation of European Standards in Ukrainian Educational Research" (June 21, 2019) was held in the following networks: Educational Research Potential for Developing Education in Ukraine; Practical Application of Educational Research for Pre-Service Teacher Training Reform in Ukraine; Academic Integrity and European Ethical Standards in Educational Research [22]. 62 articles were submitted to the conference. Among them, 3 articles contained a secondary data analysis, and 14 -a primary quantitative analysis. Articles with secondary analysis accounted for about 5% of the total number of articles, and articles with quantitative methods -for about 23%.

Journal articles with data: Journal of Peace Research
One of the trends in the social and behavioral sciences is to support the idea of reproducible research, as a result of which the author publishes, together with the publication, research data, scripts for their processing, support tools and files. This data can be the useful source of secondary analysis.
Consider the example of the Journal of Peace Research [23], how to publish reproducible research on peace and conflict. The journal is guided by the principles of access to data and transparency of research [24], which means that research authors, editors, publishers, and professional associations seek to increase the reliability and openness of various studies by publishing the authors data.
We obtained the following statistics about the number of articles with data in 1984-2018 (Table 3).
An analysis of the dynamics of the number of articles with data published in the journal since 1984 (Table 3, Fig. 4) shows that, unlike one article in 1984, readers    Consider, for example, one of the 2018 articles which provides the contents of the files (Fig. 6). We can see that the authors of the article provided numerous files as support for their article [25]. They also offered a detailed explanation of the software needed to reproduce the calculations; which packages should be installed etc. R 3.4.1 and Stata 13.1 software versions are required to reproduce this study; the following R packages need to be installed: "MatchIt", "dplyr", "ggplot2", "haven", "readr", "xtable", "tidyverse", "RStata". Among the files that accompany the article are .txt, .csv text files; scripts R, Stata; html files; Stata (.dta) and R (.rda) data files. Thus, the magazine promotes the publication of reproducible research, which has the following advantages: increased reliability of research and quality of data analysis; expanding further research on this topic; examples and standards for teaching future researchers, in particular in the field of peace and conflict studies [26].

Secondary data analysis courses for master and PhD students
We analyzed courses on SDA for masters and doctoral students. We will discuss the following two examples.
The course 'Practical Methods for Secondary Data Analysis' from the School of Public Health (University of Minnesota) emphasizes practical approaches to prestatistical data processing and analysis with Stata statistical software, advantages and limitations of several national data resources are discussed [6]. The course is designed for three credits. The course goals are: • Better appreciation the steps required to take a raw data set and produce an analytic data set. • Appreciation the vast number and variety of existing data resources potentially useful in the field of public health investigations, as well as methods for finding and exploiting these data. • Understanding basic and moderately advanced data structures, including the binary and hexadecimal number systems, flat-files, relational and hierarchical data resources. • Ability to read into Stata at least moderately complex ASCII data. • Ability to exploit existing data conversion software.
• Awareness of central issues in complex sampling designs. • Familiarity with U.S. Census data.
• Awareness of the National Health Interview Survey.
Course 'Use of electronic archives of social data' from Taras Shevchenko National University of Kyiv [27]. The purpose of the course is to familiarize the students with the possibility of using in the research and analytical work of electronic archives of social data, including archives of the results of sociological research (quantitative and qualitative); data from statistical agencies; global (international) indexes and their ratings of countries, cities, regions; data from national and international non-governmental research organizations, etc. The purpose is also to teach students to search data in electronic archives; to acquaint students with the peculiarities of preparation for analysis of data obtained from archives, with the specifics and methods of secondary data analysis; provide basic knowledge of data management planning in empirical sociological projects and preparation of own research data for placement in electronic archives of social data.
An analysis of the content of these courses showed that not all topics related to the SDA (Fig. 1) were reflected in their programmes.
In addition, we have not found such courses for masters and doctoral programs in the field of pedagogical sciences in Ukraine. So, it is important to prepare future researchers for a secondary data analysis using new computer tools and technologies. This is especially true for PhD students in the field of education. They should search, analyze and interpret educational statistics in the framework of their dissertations.

Conclusion
This model of this training may consist of three components. Theory component includes the theoretic basic of secondary data analysis, strength and weakness of this methodology. Practice component contains the examples and tasks of using SDA in educational research with computer tools (specialised and general). These two components are implemented in lectures, seminars and independent work in courses on research methods and courses on quantitative methods. The third component is implemented as PhD student support in the process of writing a dissertation work and includes consultations, seminars and peer reviews.
In our opinion, the course of research methods need to contain a mandatory unit about SDA. The further development of the study is integration of secondary data analyses in the courses of research methods for PhD students in the field of Education in Ukraine and building the model of their support on the stage of thesis writing. This model can be structural and content [28] or structural and functional.