Validation of Scales for Measuring Factors of Teaching Quality from the Dynamic Model of Educational Effectiveness *

large-scale educational effectiveness research requires valid student questionnaires to assess teaching practices. This research validated eight scales for measuring teaching factors from the Dynamic Model of educational effectiveness (DMee). parallel versions of scales for measuring teaching factors in mathematics and biology were constructed and validated in two studies. in the first study, an exploratory factor analysis was conducted on data from 683 students. in the second study, the structure was cross-validated via a confirmatory factor analysis (Cfa) on a sample of 5,476 students. The multi-group Cfa resulted in an acceptable metric invarience for all scales, indicating that the scales have comparable factor loadings. however, unsatisfactory scalar invariance suggested that the scales could not be used to compare teachers of different subjects. Testing alternative structural relations between the teaching factors did not confirm that the data fit the DMee model adequately, although the fit parameters were better than for the alternative theoretical models. for mathematics, the external validation of the scales showed that the scales correlated with job satisfaction, external control, and teacher self-efficacy reported by the teachers. The scales are reliable and valid and could be applied to different school subjects. loaded on the Challenge / Cognitive activation factor; Classroom as a learning environment corresponded to the supportive Climate factor, items from scales structuring, orientation, and assessment loaded on the structure factor; and Management of Time corresponded to the Classroom management 6 factor (model 2). in the one-factor model, all items loaded on one factor (model 3).


170
PSIHOLOGIJA, 2022, Vol. 55(2),  Highlights: • Scales for measuring teaching factors from the Dynamic Model of educational effectiveness were developed. • The scales had a stable factor structure and invariant item loadings across two subjects, mathematics and biology. • The scales correlated with teachers' job satisfaction, teacher self-efficacy, and teachers' beliefs about students' external control in mathematics. • The scales could be used for measuring teaching practices in various school subjects. • Teachers of different school subjects should not be compared via these instruments.
aside from individual and family factors, the quality of teaching is the most important determinant of student achievement Brophy & good, 1986;Creemers & kyriakides, 2008;fauth et al., 2014;hattie, 2009;klieme, 2012;sanders & rivers, 1996;scheerens, 2000, 2016wright et al., 1997). There is a consensus in the literature on the relevant factors of teaching quality, although their nomenclature and operationalization somewhat vary. according to klieme (2012), teaching quality consists of three factors: structure/classroom management, supportive climate, and challenge/ cognitive activation. Muijs et al. (2014) andkington et al. (2009) offer more specific and numerous aspects of teaching quality, such as structuring classes and materials, providing feedback to students, and proactively maintaining discipline.
one of the widely recognized theoretic models through which teaching quality has been conceptualized is the Dynamic Model of Educational Effectiveness (DMee; Creemers & kyriakides, 2008). The model represents an amalgam of theory and empirical research on the factors that affect students' learning and achievement. it has been empirically tested in several studies (Creemers & kyriakides, 2015;kyriakides et al., 2020). according to the DMee, the determinants of student achievement come from different hierarchical levels -student, teacher, school, and system levels. some of the factors at the student level are socio-economic status, motivation, high expectations of students, intellectual abilities, and perseverance. school-level factors (e.g., cooperation between teachers in school), and system-level factors (e.g., quality national curriculum) do not impact students directly, but enable lower-level factors to work well. for example, improving cooperation between teachers may lead to improvement of individual teacher's ability to better engage students in the classroom, which will impact student achievement. within the education system, teaching factors affect students most.
The eight teaching factors included in the DMee describe the following aspects of teacher work (Creemers & kyriakides, 2008): Orientation refers to emphasizing the relevance and purpose of teaching content and activities in the context of students' knowledge, everyday application, and scientific knowledge; Modelling relates to the development of students' strategies to solve difficult problems and evaluate and organize their own learning; Application implies exercising the taught content and applying it in different situations; Questioning refers to actively engaging students by asking them various questions, seeking argumentation, and so on; Assessment encompasses identifying students' needs and offering constructive feedback, as well as correcting teacher's own work; Structuring refers to active and clear teaching and the creation of well-structured and organized lessons, for example, by presenting the outline and repeating the most important points at the end of the class as well as positioning lesson content within the wider context of students' knowledge; Classroom as a Learning Environment refers to creating a positive and supportive climate in the classroom; Management of Time refers to the teacher's use of classroom management procedures to maximize students' time on task and to create an effective learning-oriented classroom without distractions.
The teacher level of Creemers and kyriakides' DMee (2008) corresponds well to another widely accepted model of educational effectiveness -the already mentioned klieme's model of three overarching factors of teaching quality (2012). klieme's model represents a theoretical basis for student questionnaires used in pisa international testing (oeCD, 2019). according to klieme (2012), the DMee's Management of Time belongs to the Classroom Management part of the structure / Classroom Management factor, while structuring, orientation, and assessment belong to the structure part of the structure/Classroom Management factor. DMee's Classroom as a learning environment corresponds to the supportive Climate factor, while the DMee's Modelling, application, and Questioning correspond to the Challenge / Cognitive activation factor.
in order to examine the teaching quality and its effects, it is necessary to adequately assess the teaching practices. This is commonly done by surveying students or teachers. having a valid and reliable instrument that assesses all relevant teaching practices that impact student achievement and that can be applied to various school subjects is invaluable to researchers but especially to practitioners and policy makers. around two thirds of european countries require schools to undertake self-evaluation (european Commission/eaCea/eurydice, 2015). Together with external evaluation, self-evaluation is seen as an important instrument for school development (european Commission, 2018). however, it is unrealistic to expect all schools to be thoroughly familiar with the latest literature on quality teaching practices and to possess the expertise to develop high-quality questionnaires to assess them. furthermore, policymakers can make important decisions in the areas of professional development of teachers and accountability mechanisms based on the results of educational effectiveness studies that can use such questionnaires. That is why researchers should provide valid and reliable instruments that are both theory-and evidence-based, as well as comprehensive and practical.  (2),  The literature has indicated that the most common problems with surveying include socially desirable answers of teachers and incompetence of younger students to evaluate teaching (Nielsen & gustafsson, 2016), along with teachers' popularity with younger students, which positively biases students' assessments (fauth et al., 2014). however, even though individual responses of students can be unreliable, classroom aggregates of their responses are valid and reliable (kane & Cantrell, 2010(kane & Cantrell, , 2012Nielsen & gustafsson, 2016). Younger students -third graders -can also give valid and reliable assessments of teaching quality (fauth et al., 2014). a combination of observations, students' assessments, and previously determined teacher's value added 1 has proved to best predict teacher effectiveness. since each measure has its pros and cons, combining them gives the best results (kane & Cantrell, 2010(kane & Cantrell, , 2012. This research focusing on scales for measuring teaching quality represents a part of a large-scale study. in the larger study, eighth-grade students in serbia were surveyed and their responses were aggregated to the teacher level in order to adequately assess the teaching practices of their mathematics and biology teachers. The goal was to evaluate the quality of teaching in schools in serbia and then identify its effects on student achievement on the final exam and student interest in these subjects. The findings on the effects of teaching practices on student achievement and interest in mathematics and biology are presented in another paper (Teodorović et al., 2021).
The aim of this research was to validate eight scales for measuring teaching factors from the DMee. for this purpose, two studies were conducted. in the first study, a preliminary version of the eight scales was explored. The second study cross-validated scales' structure, tested their theoretical correspondence with the DMee, and examined their external validity by exploring their relations with relevant teachers' psychological variables. since the aim was to create scales that would be generic in nature and applicable to different school subjects, in this study, we validated scales referring to two subjects -mathematics and biology 2 .

Study 1
The objectives of this study were the initial construction of scales for measuring the teaching factors from the DMee and the exploration of their latent structure.
1 Teacher's value added has been determined by adjusting their students' achievement gains for student characteristics such as prior performance and demographics (kane & Cantrell, 2010(kane & Cantrell, , 2012. 2 These subjects were selected because they were the only two subjects that were taught in all four grades of lower secondary education (isCeD 2) in serbia and that had adequate measures on both TiMss 2011 international testing of 4 th graders in serbia and on the final exams of 8 th graders in serbia in 2015, which were the requirements for our larger study. having two achievement measures for the same group of students -one when they were in the 4 th grade and another when they were in the 8 th grade -was necessary for the larger study, in which we tried to capture the effect of four years of accumulated teaching practices on student achievement.

Sample and Procedure
The sample included a total of 683 seventh-grade students from 16 primary schools (out of 20 that were contacted) in nine cities in Serbia (Belgrade, Ub, Kragujevac, Kruševac, sokobanja, Novi sad, sombor, sopot, and jagodina). in 13 schools, two classes were sampled per school, and in three schools, one class was sampled per school. The sample of schools and classes was convenient. within each class, one half of the students filled out a questionnaire assessing the teaching of mathematics (n = 346, 50.7%) and the other half assessed the teaching of biology (n = 337, 49.3%). students completed the questionnaires during one school lesson in october 2014. in this study, we did not gather any socio-demographic information from respondents.
at the time when this study was performed (in 2014), there were no institutional review Boards (irB) and their approval was not obligatory, nor was such an approval requested by the project funders and schools. The whole research project (including study 1 and study 2) was approved by the Ministry of education of the republic of serbia and the study was carried out with the support and approval of school authorities. The project was implemented in accordance with the law on personal Data protection and the best practices at the time.

Instruments
in the process of constructing the preliminary versions of the scales, the existing instruments for the assessment of the DMee were examined (Creemers et al., 2012) and some items were modified to fit student age or the specific context of the school environment in serbia. Then, the existing items from the pisa study (oeCD, 2013) and the DaQs database (Datenbank zur Qualität von schule, n.d.) were analyzed and the scales/items that corresponded to the theoretical content of factors from the DMee were added to the item pool. finally, the researchers constructed new items to better cover the theoretical content of the teaching factors. This process resulted in a total of 70 items distributed across eight scales. a smaller number of constructed items were reverse coded. students assessed the frequency of a certain practice during class on a 4-point likert scale (1 = never, 4 = always or almost always).

Statistical Analyses
an exploratory factor analysis (efa) was conducted to investigate the structure of each individual scale. we applied the principal components method. To determine the optimal number of factors, we consulted two criteria: the gutmann-kaiser criterion and a parallel analysis with the 95th percentile of randomly generated eigenvalues. The criteria for item retention were a factor loading ≥ .50 and no cross-loading. The factors were rotated via the promax rotation. The efa was applied iteratively until all items met the criteria. The efa was carried out separately for the scales referring to mathematics and biology and only the items that fulfilled the described criteria for both subjects were kept. after the final versions of the scales were determined, we applied an efa with the same criteria to the whole item pool to investigate whether the latent structure would correspond to the theoretical factors from the DMee. for the final solution of eight scales, we calculated descriptive statistics, internal consistency (Cronbach's alpha), and mutual intercorrelation (pearson correlation coefficients). all analyses were carried out in spss 20.0 software.

Exploratory Factor Analysis
The kaiser-Meyer-olkin coefficient of sampling adequacy indicated that it was justified to apply an efa on all scales (see supplementary material 1, appendix a). The efa resulted in five unidimensional scales: orientation (59% of the variance was explained by the first principal component in mathematics and 55% in biology), Modelling (59% in mathematics and 60% in biology), application (59% in mathematics and 57% in biology), Questioning (62% in mathematics and 54% in biology), and assessment (56% in mathematics and 53% in biology). Detailed results of the efa with the structure and the factor loadings for all scales are presented in supplementary material 1, appendix a (https://osf.io/acfpu/). for the structuring scale, guttman-kaiser resulted in two factors for both subjects, while the parallel analysis recommended a one-factor structure for mathematics and a two-factor structure for biology. since we aimed to create a scale with the same structure for both subjects, we opted for the twofactor solution 3 . The two-component structure explained 59% of the variance in mathematics and 56% in biology. The first component referred to the teaching practices aimed at connecting the material with different lessons, different subjects or students' out-of-class knowledge. This component was named Connecting. The second component described clear and understandable teaching and setting clear learning goals. The component was named Clarity. The correlation of the extracted components was .59 for mathematics and .43 for biology.
initially, a two-component structure was also obtained for the Classroom as a learning environment scale, explaining 51% of the variance in mathematics and 49% in biology. The first component described the relationship between the teacher and the students, with the teacher actively providing help and encouragement, while the second component described students' mutual relationships (see supplementary material 2, appendix a, Table 14 for the initial version of the efa; https://osf.io/umcg5/). The extracted components did not correlate with each other (r = .01 for mathematics and r = .16 for biology), suggesting that these contents do not belong to the same conceptual space. since the second component did not relate directly to the teacher's actions, we decided to exclude these items from the scale. The efa was re-applied and the repeated analyses resulted in a unidimensional structure for both subjects. The first component explained 54% of the variance of items in mathematics and 43% in biology.
for the Management of Time scale, the guttman-kaiser criterion indicated it was optimal to keep three factors, while parallel analyses suggested two factors for mathematics and three for biology. since our aim was to obtain correspondent structures for both subjects, we leaned on the guttman-kaiser criterion (see the other solution in the supplementary material 2, appendix a, Table 16). The three-dimensional structure explained 67% of the variance in mathematics and 64% in biology. Three components corresponded to the three aspects of this construct: loss of time, Classroom disorder, and teacher's Classroom management. The three components' correlations varied in absolute values from .27 to .47 for mathematics and from .40 to .50 for biology (in the theoretically expected direction). Table 1 shows the number of items before and after the efa of all scales, as well as the internal consistency (Cronbach's alpha) of the final version of the scales, which ranged from good to excellent. only the loss of time subscale from the Management of Time scale had suboptimal internal consistency when applied to biology.  Table 2 shows the intercorrelations of teaching factors. all correlations were moderate to high and positive. To determine whether the scales' items would group in line with the theoretical model of the DMee, we applied a joint efa on the pool of 59 items from the final versions of the scales. The guttmankaiser criterion recommended eight components for mathematics and eleven components for biology, while the parallel analysis suggested that the optimal number of components was three for mathematics and four for biology (see supplementary material 1, appendix B for a detailed presentation of the results). since the guttman-kaiser criterion tends to overestimate the number of factors, we opted for the factor solution suggested by the parallel analysis 4 . Neither the number of components nor their structure corresponded to those from the DMee. a simplified presentation of the structure of these factors is shown in Table 3. for mathematics, the first component predominantly gathered indicators of orientation, assessment, Modeling, and structuring-Connecting; the second component gathered indicators of Classroom as a learning environment, Management of Time -loss of time, structuring-Clarity, application, and Questioning; while the third component included two subscales of Management of Time -Classroom disorder and Classroom Management. for biology, the first component was saturated with indicators of Modeling, assessment, and application; the second component was saturated with indicators of Classroom as a learning environment, Questioning, and structuring-Clarity; the third included items from the three subscales of Management of time; and the fourth gathered structuring-Connecting and orientation. The results of this study will be discussed together with the results of the second study in the general Discussion section. in the second study, we aimed to: a) cross-validate individual scale structures on a new, larger sample of students, b) test which theoretical model best represents the overall structure and relations of the scales, and c) externally validate the final versions of the scales by examining their relations with specific teacher psychological variables -job satisfaction, teacher self-efficacy, and external control -which are associated with teaching practices (gkolia, Belias, & koustelios, 2014;klusmann et al., 2008;kunter et al., 2013;rissanen et al., 2018;rose & Medway, 1981).

Sample and Procedure
The sample included students from 125 schools in serbia 5 . in 115 schools, two 8 th grade classes were sampled per school, while one 8 th grade class was chosen from each of the remaining 10 schools, resulting in 240 classes attended by a total of 5,476 eighth-graders. in april 2015, one half of the students within each class evaluated the teaching of mathematics and the other half assessed the teaching of biology. in 33 classes, there were fewer than 20 students, so all those students assessed only the teaching of mathematics. Therefore, 2.895 students (53.4%) assessed mathematics classes and 2.527 (46.6%) assessed biology classes. The sample included 48.2% of males and 51.2% of females, while 37 participants (0.7%) did not indicate their gender. The average age was 14.54 years (SD = 0.33, range 13.16-17.66).
a sample of teachers was also included in the study. out of 2.401 teachers of various subjects in the large study (20 from each school with certain dropout), there were 164 mathematics teachers (69.5% female; average work experience 14.1 years (SD = 10.48) and 135 biology teachers (85.9% female; average work experience 18.4 years (SD = 9.67) who could be paired with their students who filled out the questionnaires on teaching. an external validation of scales for measuring teaching factors was performed on this sample of teachers.
as mentioned in study 1, ethical approval was not acquired for this study since there were no institutional irBs at the time when it was performed (in 2014 and 2015) and their approval was not obligatory. since we gathered more data about participants in this study, in accordance with the law on personal Data protection and the best practices in force at the time when the project was implemented, all personal information about students, such as questions regarding house possessions (not reported in this paper), was gathered from students' parents who were informed about their children's participation in the study. Teachers who participated in the study were informed about project aims and consented to participate. Children, parents, and teachers participated anonymously and their data were paired through the system of codes.

Instruments
in study 2, students filled out questionnaires for measuring teaching factors from the DMee that resulted from the efa of study 1. Three teacher psychological variables were selected in order to determine the external validity of the scales.
5 The results reported in this paper are a part of a larger project, which utilized the nationally representative sample of 156 schools from the TiMss 2011 study. out of 156 schools, 27 schools were excluded because of the small number of students, which made subsamples from these schools inadequate for the design of the planned project. another four schools refused to participate in the study. More information about this project and the sample is presented in the paper by Teodorović et al. (2021). Job satisfaction scale (TiMss, 2009) contains six items with a four-point likert response scale and has a satisfactory internal consistency (α = .76).
External control scale (skaalvik & skaalvik, 2007) implicitly measures teachers' beliefs about their influence on students' academic success. The scale contains five items with a four-point likert response scale and has a satisfactory internal consistency (α = .72). a higher score on the scale indicates the teacher's belief that students' success is primarily determined by their abilities and family environment, and not the teacher.
Self-efficacy scale (Tschannen-Moran & woolfolk hoy, 2001) measures teachers' perception of their performance in a variety of teaching activities. The scale has three subscales: efficacy in instructional strategies, efficacy in classroom management, and efficacy in engaging students. each subscale is operationalized by four items with a five-point likert response scale. The reliability of each subscale is good (from .75 to .89), with α = .89 for the entire scale.

Statistical Analyses
a confirmatory factor analysis (Cfa) was conducted to test the scale models derived as a result of the efa. The Cfa was applied separately to the data obtained for mathematics and biology. The following criteria were applied: a) corrected chi-square test, which is more tolerant to sample size than a simple chi-square and whose value should be under 5 (Mueller, 1996); b) root mean square error of approximation (rMsea) and standardized root mean square residual (srMr) with values < .05 indicating excellent fit and values < .08 indicating acceptable fit (Byrne, 2010); c) comparative fit index (Cfi) and normed index of fit (Nfi) with values > .90 indicating acceptable fit and values > .95 indicating excellent fit (kline, 2005). if the model parameters were below acceptable levels, modification indices were analyzed and up to two corrections were made (Byrne, 2010;kline, 2005).
since the aim of this study was to test the stability of the theoretical model independently of the subject, a multi-group Cfa was applied to each scale by examining the configural, metric, and scalar invariance, with the subject (mathematics/biology) as the grouping factor. The configural invariance implies that items have significant loadings on associated dimensions. The metric invariance assumes the invariance of item loadings across compared groups, while the scalar invariance tests the invariance of intercepts. Comparisons between the invariance models are performed by calculating the differences of their fit parameters (ΔCFI, ΔRMSEA, and ΔSRMR), where the differences should not be greater than .01 (values > .01 are considered significant at the p < .01 level; Chen, 2007). The lack of statistically significant differences between the models indicates that the more constrained model fits the data equally well.
additionally, we used Tucker's congruence coefficient to verify the congruence between factor structures for mathematics and biology (harman, 1976;Tucker, 1951). This statistic is used in large datasets because the standard Cfa is sensitive to sample size and thus can easily dismiss the assumption of equal factor structures across groups (lorenzoseva & ferrando, 2003). The values between .98-1.00 imply excellent, .92-.98 good, .82-.92 marginal, .68-.82 weak, and < .68 poor congruence (MacCallum et al, 1999).
after the final versions of the eight scales were established, their internal consistency (Cronbach's alpha) and descriptive data were calculated.
in order to investigate whether the DMee theoretical model best fits the data, we conducted a Cfa (with the same criteria defined above) and contrasted it with the structure that corresponds to klieme's latent-factor model (klieme, 2012) and the structure that corresponds to the one-factor model. The DMee model was defined by the eight factors based on the constructed scales that were mutually correlated (model 1). in klieme's model, the items from DMee scales were grouped so that all items from Modelling, application, and Questioning loaded on the Challenge / Cognitive activation factor; Classroom as a learning environment corresponded to the supportive Climate factor, items from scales structuring, orientation, and assessment loaded on the structure factor; and Management of Time corresponded to the Classroom management 6 factor (model 2). in the one-factor model, all items loaded on one factor (model 3).
finally, in order to examine the external validity of the scales, student evaluations of the eight teaching factors were aggregated at the teacher level by calculating the average value of students' responses. These values represented teachers' scores on the teaching factors from the DMee. They were then correlated using the pearson correlation with the variables obtained from teachers' self-assessment 7 .
all analyses were performed in spss 20.0 and amos 21.0 software.

Confirmatory Factor Analysis
The majority of the scales had adequate fit parameters in the Cfa (Table  4), with the exception of the corrected chi-square. fit parameters were below the satisfactory values for orientation, application, and assessment scales, so modifications were introduced. in each of the three scales, the errors between the two items were correlated (see figure 1) and an acceptable fit was achieved. These modifications suggest that the pairs of items for which the error correlations were introduced share similarities beyond the main subject of measurement of the scale -in this case, both items refer to stressing out the importance of the teaching content or working through the most common students' mistakes. The parameters of these modified versions of the model are shown in Table 4. except for these modifications, for all scales, the structure obtained in the efa in study 1 was confirmed in the Cfa on the data from study 2. 6 although klieme's model theoretically includes 3 teaching factors, he sometimes divides his factor structuring / Classroom management into two -structuring and Classroom management. in the efa performed on all items from the eight scales constructed in study 1, Management of time emerged as an individual component. Therefore, we decided to lean on klieme's model with divided structure and Classroom Management, which might better correspond to the data from our study and could be a more appropriate alternative to the DMee. 7 we first planned to examine the unique contribution of individual teaching factors from the DMee to these external criteria, by conducting regression analyses with teacher psychological variables as criterions and teaching factors as predictors. however, multicollinearity diagnostics indicated that collinearity was a problem (the highest Vif was 12.11 for mathematics and 11.24 for biology and values higher than 10 are usually taken as critical). Therefore, we decided to only carry out the pearson correlations between these sets of variables.  The multi-group Cfa was performed next and its results are shown in supplementary material 1, appendix C (https://osf.io/acfpu/). The values of the corrected chi-square test were unsatisfactory for all scales, but this is not uncommon when dealing with particularly large samples. as for other fit parameters, for all scales, both the configural and metric invariance had satisfactory Cfi, rMsea, and srMr. The differences between them were not statistically significant, indicating invariance of factor loadings. for Management of time, Classroom as a learning environment, and somewhat structuring (except for the Cfi), the model assuming scalar invariance was also acceptable and not significantly different from the model with metric invariance, suggesting equal intercepts in the two subjects. for all other scales, the fit indices of the scalar invariance models were not acceptable, indicating different intercepts when these scales were used for different subjects. Tucker's congruence coefficient, however, showed excellent congruence for all scales, as its values were all .99.
The structure of all eight tested models is presented in figure 1. in summary, the fit indices (except for the corrected chi-square) for all scales suggest that the factor structures of the scales are comparable for mathematics and biology. specifically, item loadings on latent factors were invariant, but the intercepts generally differed. This means that it is justified to use these scales to compare teachers of the same subject, but not teachers of different subjects.

Descriptive Statistics and Reliability of Scales
after determining the final structures of the scales, scale scores were calculated as an average of responses to the belonging items. standardized skewness and kurtosis, as well as the kolmogorov-smirnov test of normality of distribution indicated that all variables leaned towards positive scores, i.e., that students' evaluations of teaching factors were positively biased ( Table 5). The internal consistency of the scales (Cronbach's alpha) varied from good to excellent.
Descriptive statistics for teacher psychological variables (job satisfaction, external control, and teacher self-efficacy) are presented in Table 6.  Legend: k-s test = kolmogorov-smirnov normality test.
Note . The values in columns titled stand. skewness and stand. kurtosis displayed in the table are calculated by dividing skewness and excess kurtosis values by their standard error. we will refer to these as standardized skewness and standardized kurtosis respectively. Values in excess of +-1.96 or +-2.56 (i.e. above 1.96 or 2.56 or below -1.96 or -2.56) would mean that the sample value of skewness/kurtosis fall outside the 95% or 99% confidence interval formed around the values of 0, which indicates a mesokurtic, symmetric distribution. The results of the Cfa for different theoretical models -the eight-factor model from the DMee (model 1), klieme's latent factor model (model 2), and the one-factor model (model 3) -showed that only model 1 had satisfactory rMsea and srMr (supplementary material 1, appendix D). however, none of them had satisfactory χ2/df, Cfi, Nfi, and NNfi, suggesting that none of these theoretical models described the data optimally. additionally, we contrasted these models against each other and the results showed that model 1 is significantly better than models 2 and 3 (comparison of model 2 and model 1 for mathematics Δχ2(22) = 2826.93, p < .001 and for biology Δχ2(22) = 3063.74, p < .001; comparison of model 3 and model 1 for mathematics Δχ2(28) = 12394.37, p < .001 and for biology Δχ2(28) = 12230.85, p < .001; comparison of model 3 and model 2 for mathematics Δχ2(6) = 9567.44, p < .001 and for biology Δχ2(22) = 9167.11, p < .001).

External Validity of Scales
in order to examine the external validity of scales, aggregated teacher scores on the teaching factors were correlated with the measures of teachers' job satisfaction, external control, and their self-efficacy. The results are presented in Table 7.

General Discussion
The results of exploratory and confirmatory factor analyses showed that each individual scale for the measurement of teaching factors of the DMee had a clear and theoretically adequate structure and good internal consistency. six scales -orientation, Modelling, application, Questioning, assessment, and Classroom as a learning environment -had a unidimensional structure, while two scales -structuring and Management of Time -had a multi-dimensional structure. Two aspects of the structuring scale -Clarity and Connecting -were interrelated and formed a single scale with good internal consistency. out of the three subscales of the Management of Time scale -loss of time, Classroom disorder, and Classroom Management -the first two positively correlated with each other and negatively correlated with the third, which is in line with theoretical expectations. The overall scale had good internal consistency. it should be noted that all reverse coded items were eliminated from the scales due to unsatisfactory psychometric characteristics and that the final version of the scales contains only positively formulated items. The scales in serbian and english are publicly available on the osf page of the project (https://osf.io/q85zu/).
The intercorrelations of the scales were moderate to high. while these correlations may seem higher than would be desirable in terms of discriminant validity of the scales, one should bear in mind two things. first, research has shown that quality teaching practices go hand in hand, i.e., that a good teacher does many things well, which explains high correlations (Muijs & reynolds, 2000;Teodorović, 2011). Actually, one of the main assumptions of the Dynamic Model of educational effectiveness is that teacher factors are interrelated (kyriakides et al., 2009). secondly, student assessment of teaching practices may somewhat suffer from the halo effect, i.e., it is possible that student responses were partly influenced by the general impressions of the teacher. The fact that students' assessments of teaching factors were more distributed towards positive values on all scales may suggest that students were slightly biased toward their own teachers or that teachers on average exhibited quality teaching. however, this does not mean that student assessment does not have its place in research, as it has been established that students can give reliable and valid assessments of teaching practice even in younger grades (fauth et al., 2014; kyriakides et al., 2014). after all, a teacher's self-assessment measures are also influenced by personal biases. however, high intercorrelations of the scales may suggest that at least some of the teacher factors from DMee are not separate theoretical constructs.
Taking into account mutual correlations of the scales, we decided to perform a joint efa on all items from final versions of the eight scales and the results did not converge to the DMee or to other alternative theoretical models such as klieme's model (klieme, 2012). analyses revealed three general factors for mathematics and four for biology, meaning that the structure was not correspondent for the two subjects. There were, however, certain similarities between the two obtained solutions. Management of Time converged into one factor, although for mathematics it did not include the loss of time subscale. Classroom as a learning environment went along with structuring-Clarity in both subjects, although it was accompanied by other theoretical concepts in mathematics. The factor that gathered orientation, structuring-Connecting, Modeling, and assessment in mathematics, separated into two in biologyone that joined orientation and structuring-Connecting and the other that joined Modeling and assessment. from the perspective of the DMee, the most important conclusion could be that two aspects that theoretically belong to structuring (Connecting and Clarity) seem to be conceptually different and that they are closer to other aspects of teaching practices than to each other. To conclude, although it seems that certain aspects of teaching are more likely to go together in the classroom, it seems that relations between different theoretical aspects of teaching depend, at least to some extent, on the subject that is taught. fit of the scales to the theoretical model, as well as discriminant validity of the scales, would likely be improved with selection of items that are highly specific for each teaching factor. however, this may narrow the meaning of the teaching factors and limit their usefulness in predicting student achievement.
while our analyses did not show sufficient discriminant validity of the scales, this is not the reason to abandon either the Dynamic Model of educational effectiveness or the instrument. The structure of the theoretical model has been validated across and within several countries using Cfa and structural equation Modeling -seM (Belgium/flanders, Cyprus, germany, greece, ireland and slovenia; kyriakides et al., 2014), albeit with a smaller number of items than in our study (28 vs. 59, as different items had to be removed from the original 49-item questionnaire in the study) and with two identified second-order factors (quality of teaching practices and quantity of teaching practices). More than 20 studies conducted in different countries have provided empirical support for the model (for a review of these studies see kyriakides et al., 2021). The DMee has even been used in teachers' professional development after which student achievement results improved (antoniou et al., 2011). finally, although the confirmatory analyses in study 2 did not show that data satisfactorily fit the theoretical structure from the DMee, that structure was still somewhat better than the structure proposed by klieme's model or the model that proposes that the whole variance in students' estimates of teaching may be due to the halo effect. Nevertheless, the results of our study indicate that either these new scales, the theoretical constructs from DMee, or both could be further refined. while our instrument showed greater overlap of teaching practices than is desirable, the scales still reliably and validly measure important aspects of teaching quality that are based on one of the more utilized theoretical and empirical models of educational effectiveness in the world -the DMee. ideally, data from student questionnaires should be supplemented with classroom observation ratings in order to improve the reliability and validity of the model (kyriakides et al., 2014).  (2),  The results of the multi-group Cfa indicate that, when applied to different subjects, the scales measure the same constructs in a structurally identical manner. however, it is not justified to compare teachers of different subjects because the estimates may be affected by factors that are not the subject of measurement (e.g., various types of biases; fisher & karl, 2019; xu & Tracey, 2017). fisher & karl (2019) claim that not reaching scalar invariance when the scale is applied to different groups should not be a problem as long as researchers keep in mind this limitation. on the other hand, the fact that all eight scales showed identical structures for mathematics and biology suggests that it is also appropriate to use these scales for other subjects. The very fact that the items describe general and not subject-specific aspects of teaching quality is in line with this conclusion.
Testing the relations between the teaching factors reported by students and teacher psychological variables reported by teachers showed that, in mathematics, the majority of teaching factors were associated with teachers' job satisfaction and self-efficacy in terms of engagement of students, which is consistent with previous literature (gkolia et al., 2014; klusmann et al., 2008; kunter et al., 2013). additionally, external control, i.e., the belief that the teacher does not have a major impact on student achievement, was partly related to lower teaching quality, particularly with poor time management, weaker ability to create a supportive learning environment, and inadequate questioning skills. it is interesting to note that these are the aspects of teaching that require good interpersonal and communication skills, which is consistent with the results of other research where teachers' psychological variables were more related to classroom management and supportive relationship with students than with instructional abilities related to cognitive activation of students (klusmann et al., 2008; kunter et al., 2013).
in contrast to mathematics, job satisfaction, external control, and selfefficacy had less "spillover" onto teaching practices in biology. only self-efficacy in engaging students was associated with a better ability to manage time. These results suggest that it may be more demanding to bring students closer to the material and understanding of mathematics than biology, so teachers' satisfaction with their own profession and belief that they do their job well matter more in teaching mathematics than biology. similarly, teachers who are prone to shifting responsibility from themselves for their students' achievement have an easy excuse to put in less effort to advance their own practice. These findings resonate with research that identifies mathematics as a subject that requires and engages logical reasoning more than other subjects (e.g., Gómez-Veiga et al., 2018).
when applied to biology, the teaching factors measured by the scales developed in this study did not correlate with variables chosen for external validation. however, it should be stressed that they successfully predicted students' interest in biology (Teodorović et al., 2021). Thus, the lack of statistically significant correlations between these variables could probably be due to subject specificities rather than inadequate external validity. it should also be noted that the scores on teaching factors and teacher psychological variables came from different sources (students vs. teachers), which is most likely one of the reasons why correlations did not reach higher levels. overall, we can conclude that the relations of teachers' job satisfaction, their self-efficacy. and external control with the scales for measuring teaching factors from the DMeewhich were statistically significant and in the expected direction for mathematics -indicate adequate external validity of these scales.

Conclusion
This paper presents the validation of scales for measuring teaching factors from the Dynamic Model of educational effectiveness. The scales have largely proven to be reliable and valid measures of teaching quality as described in the model and can be used to assess teaching in school self-evaluation, external evaluation or educational research, although they should not be used to compare teaching of different subjects. although the scales in these two studies were used for the assessment of mathematics and biology teaching, the aim of their construction was to apply them to other subjects as well. additional validation of the scales on a subject from social or humanistic sciences is recommended.