Skip Navigation


Human Reproduction Update Advance Access originally published online on August 4, 2006
Human Reproduction Update 2006 12(6):685-718; doi:10.1093/humupd/dml034
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
12/6/685    most recent
dml034v3
dml034v2
dml034v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (19)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Broekmans, F.J.
Right arrow Articles by Lambalk, C.B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Broekmans, F.J.
Right arrow Articles by Lambalk, C.B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A systematic review of tests predicting ovarian reserve and IVF outcome

F.J. Broekmans1, J. Kwee2, D.J. Hendriks1, B.W. Mol3 and C.B. Lambalk2,4

1 Department of Reproductive Medicine, University Medical Centre Utrecht, Utrecht, 2 Division of Reproductive Endocrinology and Fertility and the IVF Centre, Department of Obstetrics and Gynaecology, Vrije Universiteit Medical Centre and 3 Centre for Reproductive Medicine, Department of Obstetrics and Gynecology, Academic Medical Centre, Amsterdam, The Netherlands

4 To whom correspondence should be addressed at: Department of Reproductive Medicine, Vrije Universiteit Medical Center (VUmc), PO Box 70057, 1007 MB, Amsterdam, The Netherlands. E-mail: cb.lambalk{at}vumc.nl


    Abstract
 TOP
 Abstract
 Introduction
 The assessment of OR
 Implications for daily practice
 Addendum
 References
 
The age-related decline of the success in IVF is largely attributable to a progressive decline of ovarian oocyte quality and quantity. Over the past two decades, a number of so-called ovarian reserve tests (ORTs) have been designed to determine oocyte reserve and quality and have been evaluated for their ability to predict the outcome of IVF in terms of oocyte yield and occurrence of pregnancy. Many of these tests have become part of the routine diagnostic procedure for infertility patients who undergo assisted reproductive techniques. The unifying goals are traditionally to find out how a patient will respond to stimulation and what are their chances of pregnancy. Evidence-based medicine has progressively developed as the standard approach for many diagnostic procedures and treatment options in the field of reproductive medicine. We here provide the first comprehensive systematic literature review, including an a priori protocolized information retrieval on all currently available and applied tests, namely early-follicular-phase blood values of FSH, estradiol, inhibin B and anti-Müllerian hormone (AMH), the antral follicle count (AFC), the ovarian volume (OVVOL) and the ovarian blood flow, and furthermore the Clomiphene Citrate Challenge Test (CCCT), the exogenous FSH ORT (EFORT) and the gonadotrophin agonist stimulation test (GAST), all as measures to predict ovarian response and chance of pregnancy. We provide, where possible, an integrated receiver operating characteristic (ROC) analysis and curve of all individual evaluated published papers of each test, as well as a formal judgement upon the clinical value. Our analysis shows that the ORTs known to date have only modest-to-poor predictive properties and are therefore far from suitable for relevant clinical use. Accuracy of testing for the occurrence of poor ovarian response to hyperstimulation appears to be modest. Whether the a priori identification of actual poor responders in the first IVF cycle has any prognostic value for their chances of conception in the course of a series of IVF cycles remains to be established. The accuracy of predicting the occurrence of pregnancy is very limited. If a high threshold is used, to prevent couples from wrongly being refused IVF, a very small minority of IVF-indicated cases (~3%) are identified as having unfavourable prospects in an IVF treatment cycle. Although mostly inexpensive and not very demanding, the use of any ORT for outcome prediction cannot be supported. As poor ovarian response will provide some information on OR status, especially if the stimulation is maximal, entering the first cycle of IVF without any prior testing seems to be the preferable strategy.

Key words: IVF/ICSI outcome / ovarian reserve / ovarian stimulation


    Introduction
 TOP
 Abstract
 Introduction
 The assessment of OR
 Implications for daily practice
 Addendum
 References
 
In Western societies the introduction in the 1960s of reliable methods of contraception has led to the birth of fewer children per family. Driven by increasing levels of female education, a growing participation in labour force and career demands, postponement of childbearing has been a secondary consequence of the so-called sexual revolution (Leridon, 1998Go). These societal changes in family planning have caused a significant increase in the incidence of unwanted infertility due to female reproductive ageing (Weinstein et al., 1993Go; Abma et al., 1997Go; Ventura et al., 2001Go).

From studies on natural populations in which no consistent methods of birth control are applied, it has been shown that natural fertility starts to decline after the age of 30, accelerates in the mid-30s and will lead to sterility at a mean age of 41 (Spira, 1988Go; Wood, 1989Go; te Velde and Pearson, 2002Go) (Figure 1). The reduction in female fertility can also be shown from contemporary population studies. The chance of not conceiving a first child within one year increases from under 5% in women in their early 20s to approximately 30% or over in the age group of 35 years and older (Abma et al., 1997Go). So, although the majority of women of older age will obtain the desired pregnancy within a one-year period, the chance of becoming subfertile increases ~6 fold in comparison with very young women.


Figure 1
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Quantitative (solid line) and qualitative (dotted line) decline of the ovarian follicle pool, which is assumed to dictate the onset of the important reproductive events [reproduced and adapted with permission from de Bruin and te Velde (2004)Go].

 
The age-related effect on female fertility has also been shown in numerous reports on the results of IVF treatment in infertile couples. The probability of live birth obtained through IVF treatment clearly decreases after the age of 35 (Anonymous, 1995Go; Templeton et al., 1996Go) and the same has been shown to be true for the implantation rate per embryo (van Kooij et al., 1996Go). In fact, female age has consistently been shown to be an important predictor of success in IVF treatment.

Over the past two decades, a number of so-called ovarian tests have been studied for their ability to predict outcome of IVF in terms of oocyte yield and occurrence of pregnancy. Some of these tests have become part of the routine diagnostic procedure for infertility patients that will undergo assisted reproductive techniques. With the current work we aim to provide an answer to the question of what the true value is of these tests to patient management. Evidence-based medicine has progressively developed as the standard approach for many diagnostic procedures and treatment options in the field of reproductive medicine (National Collaborating Center for Women’s and Children’s Health, 2004Go). Therefore, we provide a comprehensive systematic literature review, including an a priori protocolized information retrieval on all currently available and applied tests to determine ovarian reserve (OR).

What follows is first a general section in which we briefly outline the aims and the valuation of OR testing and the set-up of the systematic review. After this, we describe individually all currently available tests and their effectiveness with regard to prediction of ovarian response and pregnancy after IVF in generally accepted terms for diagnostic procedures. A unique feature of this systematic review is that we will furthermore provide where possible an integrated receiver operating characteristic (ROC) analysis and curve of all individual evaluated published papers of each test, as well as a formal judgement upon the clinical value.


    The assessment of OR
 TOP
 Abstract
 Introduction
 The assessment of OR
 Implications for daily practice
 Addendum
 References
 
OR can be considered normal in conditions where stimulation with the use of exogenous gonadotrophins will result in the development of at least 8–10 follicles and the retrieval of a corresponding number of healthy oocytes at follicle puncture (Fasouliotis et al., 2000Go). With such a yield, the chances of producing a live birth through IVF are considered optimal. In general, as outlined earlier, age of the woman is a simple way of obtaining information on the extent of her OR, in terms of both quantity and quality (Templeton et al., 1996Go). However, in the view of the substantial variation in the decline of reproductive capacity with age (te Velde and Pearson, 2002Go) (Figure 2), there is a need to identify women of relatively young age with clearly diminished reserve, as well as women around the mean age at which natural fertility on average is lost (41 years) but still with adequate OR. In clinical terms, we aim to identify women with a high risk of producing a poor response to ovarian stimulation and/or a very low probability of becoming pregnant through IVF, as well as those who still produce enough oocytes to have a good chance of becoming pregnant even if female age is advanced. If it appears possible to identify such categories of women, then management could be individualized, for instance by stimulation dose or treatment scheme adjustments (Tarlatzis et al., 2003Go), by counselling against initiation of IVF treatment or pertinent refusal to accept initiation, or by indicating the necessity of early initiation of treatment before reserve has diminished too far.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Variations in age at the occurrence of specific stages of ovarian ageing. For explanation of the background of data, see te Velde and Pearson (2002)Go. Reprinted with permission from te Velde and Pearson (2002)Go.

 
OR is currently defined as the number and quality of the follicles left in the ovary at any given time. An accurate measure of the quantitative OR would involve the counting of all follicles present in both ovaries, as is done in post-mortem studies (Block, 1952Go). For obvious reasons, in OR testing, the true size of the follicle pool has not been used as the benchmark for evaluation (Lass et al., 1997aGo; Lambalk et al., 2004Go; Lass, 2004Go; Sharara and Scott, 2004Go), apart from one distinct study (Gulekli et al., 1999Go), where whole ovary counts served as reference for several OR tests (ORTs). Instead, several proxy variables of the pool size are used in studies on diagnostic accuracy, like ovarian response to hyperstimulation with exogenous FSH in IVF and the occurrence of menopause or menopausal transition, as these events are quantitatively determined. Although related, the quality of the oocyte released from the dominant follicle at ovulation represents the other aspect of ovarian reserve. Proxy variables for oocyte quality currently used are the pregnancy probability in infertility treatment like IUI and IVF or in the follow-up of couples during and after the initial infertility work-up.

We should therefore realize that in the vast majority of studies on ORTs that will be discussed below, either ovarian response or occurrence of pregnancy in IVF serves as the benchmark to judge upon the accuracy and clinical value of the test under study. Ovarian response to adequate stimulation may be considered the most accurate, though still indirect, representation of the status of the primordial follicle pool, as it is a condition that is continuously present in the individual that undergoes the test. In contrast, the occurrence of pregnancy in such an individual may be influenced by many more factors than oocyte, and hence embryo quality, alone. Only if the occurrence of pregnancy is studied in a series of treatment cycles it may represent a solid proxy variable of the benchmark for ovarian reserve. Most ORTs are quite adequate in predicting ovarian response, but often fail to correctly predict the occurrence of pregnancy, especially if only one IVF cycle was studied.

Properties of test evaluation

ORT evaluation using response and/or pregnancy as reference or outcome variables should imply the assessment of predictive accuracy and clinical value of the test. Accuracy refers to the degree by which the outcome condition is predicted correctly. Summary statistics of accuracy include sensitivity (rate of correct identification of cases with poor response), specificity (rate of correct identification of cases without poor response), likelihood ratio (LR, how many times more likely particular test results are in patients with poor response than in those without poor response) and diagnostic odds ratios (DOR, the odds of positive test results in cases with poor response over the odds of positive test results in those without poor response) (Deeks, 2001Go; Grimes and Schulz, 2005Go). To identify all cases that will respond poorly to stimulation without judging many normal responders badly, the test must have high sensitivity and high specificity.

Positive LRs above 10 and negative LRs below 0.1 are considered as indicators of an adequate diagnostic test, while values between 5 and 10 and below 0.2 are considered to indicate a moderate test. As such, the LR can be considered a clinically useful tool to help judge the performance of the test, as the value will change when the threshold for an abnormal test is shifted.

The diagnostic odds ratio is an adequate measure when combining studies in a systematic review, as a single diagnostic odds ratio corresponds to a set of sensitivities and specificities depicted by an ROC curve and is considered threshold independent (Figure 3). It therefore can be considered a good parameter to compare the overall accuracy of a test evaluated in different studies. Although the DOR values will be higher for tests with better combinations for sensitivity and specificity, this value has not been advocated as a single measure of clinical value, as changes in the threshold used will not be expressed by a change in DOR value. For the meta-analytic approach, the range of DOR values across studies gives some indication as to the homogeneity of such studies.


Figure 3
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3. Receiver operator characteristics (ROC) curve depicting the continuous relationship between sensitivity and specificity with shifting threshold values for a given test. The area under the ROC curve (AUC) provides general information on the discriminatory capacity of the test.

 
Finally, the area under the ROC curve provides information on the overall discriminatory capacity of the test. Values of 1.0 imply perfect and that of 0.5 indicate completely absent discrimination.

Clinical value incorporates the question whether application of the test at a certain threshold will really change management or costs or safety or success rates on a population basis. It deals with the valuation of false positive and false negative test results in relation to the consequences of these test results for clinical decisions. Also it implies the rate of abnormal test results leading to altered decisions within the population of interest.

Design of ORT studies

Studies on the predictive accuracy and clinical value of ORTs should preferably be prospective in design, should examine cohorts of patients in IVF settings without exclusion of cases with signs of diminished ovarian reserve and patient management should not have been influenced by the test under study (verification bias). Also, evaluation should be equally weighted for every case, thus every case should contribute the same amount of cycles to the analysis. In most studies, only one IVF cycle is studied. A case–control design for the purpose of OR testing bears the disadvantage of retrospection and the absence of a reliable estimate of disease prevalence. The tests under study should in principle be reproducible, both at the laboratory (hormone assays) and at the operator level (ultrasound examination). Also, the outcome of treatment (response and pregnancy), serving as the reference for ovarian reserve, should be clearly defined.

The accuracy in predicting a certain outcome by the test under study should be evaluated by constructing contingency tables at several threshold levels for an abnormal test. Using the calculated sensitivity and specificity from each threshold level, a ROC curve (Figure 3) can be drawn and the calculated area under this curve represents the overall predictive accuracy of the test. Assessment of the clinical value is a complex process in which the applicability in daily practice should become clear. The overall accuracy represented by the ROC curve, the choice of a threshold for abnormality, the rate of abnormal tests at that threshold, the post-test probability of disease (i.e. poor response or non-pregnancy), the valuation of false positive and false negative test results and the consequence for patient management of an abnormal test will all contribute to the process of deciding whether a test is useful or not. Finally, the cost of carrying out the test as a routine measure and the burden to the patient balanced against the reduction in costs by excluding cases with low pregnancy prospects should contribute to the decision whether or not to apply a test.

ORTs in relation to other predictors of success

It is important for patients who are considering treatment with IVF to know the probability of success in the course of a series of IVF treatment cycles. The possibility of a live birth for any couple undergoing treatment will depend on the success rate at the individual clinic. However, equally important in the prediction of outcome are the characteristics of the couple seeking treatment (Stolwijk et al., 1996Go; Templeton et al., 1996Go; Sharma et al., 2002Go). Serious effort has been put into the build-up of prediction models that estimate the probabilities for success prior and during subsequent IVF cycles. In general, these models appeared inaccurate when external validation studies were carried out (Stolwijk et al., 1998Go; Smeenk et al., 2000Go). Intuitively, many IVF centres will use factors like female age, parity, duration of infertility, ovarian response in the first IVF attempt and embryo quality for individual counselling, albeit not through a formal prediction model. Within this practice, ORTs also may play a certain role and female age will be the one ORT applied almost without exception. The pressing question would be to what extent other, endocrine- or ultrasound-based, ORTs contribute and add to the prognostic information already obtained from the infertility work-up or the first IVF cycle. To date, studies specifically addressing this question are scarce or do not include the full range of prognostic factors available.

There are a number of studies (Eimers et al., 1994Go; Collins et al., 1995Go; Snick et al., 1997Go; Hunault et al., 2004Go; Hunault et al., 2005Go) that offer a model, based on factors like duration of subfertility, female age, parity, sperm quality and post-coital test, for the prediction of live birth among untreated subfertile couples. However, none of these models included ORTs, apart from female age. Only one study showed that on top of predictions based on the Eimers model, ORTs failed to add relevant information to the couple’s chances for a spontaneous pregnancy (van Rooij et al., 2005Go).

General remarks on physiological background of ORTs

Tests that are used to predict some defined outcome related to ovarian reserve almost without exception give assessment of the number of follicles remaining at some time point in both ovaries. Any marker giving an estimate of the remaining pool will at the same time be capable of providing, to some extent, information on oocyte quality. But on average, from prediction studies it seems that some markers give a better indication of quality than others. Female age, for instance, is the basic factor that is related to both quantity and quality. Basal FSH, through the feedback of inhibin B and estradiol, will represent cohort size but mostly at the extremes and therefore give a more thorough indication of quality aspects. This is in contrast to the more direct quantitative tests using antral follicle count (AFC), anti-Müllerian hormone (AMH) and ovarian volume (OVVOL) that are capable of describing a more complete range of ovarian reserve states. By choosing the right thresholds these tests may eventually correctly predict oocyte quality. The true relation between quantity and quality, however, remains a source of debate. Quantity is an aspect of ovarian reserve that is present in a continuous state and therefore offers a more or less continuous measurability. Quality, however, comes to expression every now and then, even in the setting of IVF. The relationship between the two aspects of ovarian reserve has become more evident when the predictive value of a poor response in a first IVF cycle was examined towards the probability of pregnancy in the actual or subsequent cycles (Klinkert et al., 2004Go). While cases with a normal response in additional cycles yielded acceptable rates of pregnancy, it was shown that in repeated poor responders this probability never surpassed 10% (de Boer et al., 2002Go; Lawson et al., 2003Go; Klinkert et al., 2004Go). It is also important to remember that there are several factors that contribute to the occurrence of pregnancy other than ovarian reserve, such as embryo transfer technique and number of embryos replaced. Even in young women with normal reserve the chance of non-pregnancy remains at least at the 50% level. So, a non-pregnancy state after IVF may even be attributed to unknown, yet non-ovarian reserve related, factors.

Approach of the systematic review

The aim of the systematic review on the value of diagnostic tests is to obtain an overall estimate of the test accuracy and clinical value based on all present evidence, after assessing the quality of the included studies and evaluating the variation in findings among the studies (Irwig et al., 1995Go; Deeks, 2001Go; Deville et al., 2002Go; Honest and Khan, 2002Go; Glas et al., 2003Go). Systematic review and meta-analysis on diagnostic accuracy and value implies consecutive steps as summarized in Table I (Irwig et al., 1994Go; Mol et al., 1997Go) please see addendum.


View this table:
[in this window]
[in a new window]

 
Table I. Stepwise approach to the systematic review and meta-analysis of diagnostic tests

 
For each study finally included in the meta-analysis, sensitivity and specificity are calculated from the contingency tables. Homogeneity of the sensitivity–specificity points is tested by means of the {chi}2-test statistic. A summary point estimate of sensitivity and specificity and the 95% confidence interval is calculated if homogeneity cannot be rejected. In case of heterogeneity, logistic regression is used to evaluate whether Quality/Methodology characteristics of a study are associated with the discriminative capacity of the test under study. If one of the study characteristics is found to have a statistically significant impact on the performance of the test, further analysis is performed in subgroups of patients. If not, it is explored whether the differences in sensitivity–specificity combinations are because of the use of different threshold levels of the test under study. For this purpose, a Spearman correlation coefficient is calculated to assess the association between sensitivity and specificity. If there is a negative correlation as defined by a correlation coefficient of –0.5 or stronger, the individual pairs of sensitivity and specificity are considered to originate from a single ROC curve. All sensitivity–specificity points are then plotted and a summary ROC curve is estimated using a random-effects regression model (Littenberg and Moses, 1993Go; Midgette et al., 1993Go; Moses et al., 1993Go).

An important issue is the fact that individual studies may produce highly variable sensitivity–specificity points in the ROC space. This is generally explained by variation in the applied threshold level for an abnormal test across the studies or the presence of considerable study heterogeneity. As in the formal analysis, the presence of heterogeneity in design will be dealt with, and the variation in sens/spec points is generally attributed to the variation in threshold levels and thus allows us to construct a summary ROC curve. At the same time, the threshold variation will prevent the possibility of assessing a single threshold for a specific test that has a generalizable value. This will only become possible if from every study the original database would be available and to date this seems to be an extreme effort.

To assess the clinical value of the test under study for the assessment of disease state (i.e. poor response or non-pregnancy), the positive and negative predictive values are calculated using the estimated summary ROC curve and assuming arbitrary prevalences of the disease in the population. An LR for a positive (or abnormal) test result is then calculated for each point on the estimated ROC curve. Subsequently, the post-test probabilities of disease at various LR values are then calculated for the arbitrary pre-test probabilities of disease, assuming independence between the pre-test probability and the performance of the test (Bancsi et al., 2003Go). Final judgement depends on the overall accuracy, the choice of the test threshold, the post-test prediction at that threshold level and the valuation of a false positive test result. In case no estimated curve from the selected studies can be constructed, the judgement upon the clinical value is based on a comparison of a preset level of sensitivity and specificity with the observed levels in the various studies.

Systematic reviewing of ORTs

The aim of the present series of systematic reviews is to assess the true diagnostic accuracy and clinical value of the ORTs known to date, when applied in an IVF/ICSI population. Reference standards used to valuate the test properties are response to ovarian stimulation and occurrence of pregnancy. No preset definition was used for these standards. For every ORT under study, a computerized MEDLINE search was performed to identify articles on the subject outlined in the previous chapters published until December 2004. Checking of reference lists of articles already obtained was done, all in an iterative fashion. Keywords used for the various searches were ‘in vitro fertilization’ or ‘in vitro fertilisation’ or ‘assisted’ or ‘intracytoplasmatic’ or ‘intracytoplasmic’, in combination with ‘test-specific’ keywords, as mentioned in the tables.

One investigator (DH or JK) read all abstracts of the articles that were identified by the search. Any article reporting on the association of the test with poor ovarian response and/or non-pregnancy after IVF or possibly containing information that was to be transformed into a predictive tabulation was pre-selected. Subsequently, all pre-selected articles were fully read and judged independently by two investigators (DH and JK), and separate 2 x 2 tables were constructed for cross classification of the test result and the occurrence of poor response and/or non-pregnancy, whenever possible. In the event of disagreement on the inclusion or exclusion of pre-selected studies for the meta-analysis or on the calculation of the 2 x 2 table data or the scoring of quality characteristics, the judgement of a third author (FB or CL) was decisive. Studies in which it was not possible to construct 2 x 2 tables were excluded. Cross-references in all selected articles were checked, and, if applicable, studies were added to the analysis.

Each study was scored by the investigators on the following Quality/Methodology characteristics: (i) sampling (consecutive versus other), (ii) data collection (prospective versus retrospective), (iii) study design (cohort study versus case–control study), (iv) blinding (present or absent), (v) selection bias, (vi) verification bias, (vii) analysis on one or multiple cycles per couple and (viii) definition of outcome, poor response and pregnancy.

In the following sections, the results of search, data extraction, quality and methodology assessment and meta-analysis of extracted data as outlined above are discussed for every ORT comprised in this review.

Basal FSH

Systematic review
Through the search and selection strategy, a total of 37 studies reporting on the capacity of basal FSH to predict poor ovarian response and/or non-pregnancy after IVF and which were suitable for data extraction and meta-analysis were identified (Scott et al., 1989Go; Padilla et al., 1990Go; Toner et al., 1991Go; Khalifa et al., 1992Go; Chan et al., 1993Go; Ebrahim et al., 1993Go; Fanchin et al., 1994Go; Huyser et al., 1995Go; Licciardi et al., 1995Go; Smotrich et al., 1995Go; Balasch et al., 1996Go; Csemiczky et al., 1996Go; Martin et al., 1996Go; Pruksananonda et al., 1996Go; Gurgan et al., 1997Go; Chang et al., 1998aGo; Evers et al., 1998Go; Ranieri et al., 1998Go; Sharif et al., 1998Go; Bassil et al., 1999Go; Hall et al., 1999Go; Bancsi et al., 2000Go; Chae et al., 2000Go; Creus et al., 2000Go; Fabregues et al., 2000Go; Jinno et al., 2000Go; Penarrubia et al., 2000Go; Mikkelsen et al., 2001Go; Nahum et al., 2001Go; van der Stege and van der Linden, 2001Go; Esposito e al., 2002Go; Chuang et al., 2003Go; Fiçicioglu et al., 2003Go; Kwee et al., 2003Go; Yanushpolsky et al., 2003Go; Akande et al., 2004Go; Erdem et al., 2004Go). Characteristics of the included studies are listed in Table II. As shown, there was a large diversity with regard to the various aspects of methodology and quality, and the definition of poor ovarian response. Logistic regression analysis indicated no significant association between any of these study characteristics and the predictive performance of basal FSH. For example, whether the design of the study was retrospective or prospective did not influence the prognostic capacity of basal FSH.


View this table:
[in this window]
[in a new window]

 
Table II. Characteristics of included studies on Basal FSH (computerized search using the test-specific keywords follicle stimulating hormone and FSH)

 
Accuracy of poor response prediction
The sensitivities and specificities, as well as the positive LRs of an abnormal test and the DORs for the prediction of poor ovarian response, as calculated from each study, are summarized in Table III, please see addendum. Sensitivity and specificity points, as plotted in Figure 4, were heterogeneous between studies ({chi}2-test statistic: P-value for sensitivity 0.001 and P-value for specificity 0.001). Therefore, calculation of one summary point estimate for sensitivity and specificity was not meaningful for overall judgement of accuracy. The Spearman correlation coefficient for sensitivity and specificity was –0.87, which was judged to be sufficient to estimate a summary ROC curve (Figure 4).


View this table:
[in this window]
[in a new window]

 
Table III. Performance of basal FSH in the prediction of poor response in IVF patients and shift from pre-test to post-test probability of poor response for patients with an abnormal (= lower than the threshold) FSH result

 


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of basal FSH in the prediction of poor response. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Accuracy of non-pregnancy prediction
Sensitivities and specificities for the prediction of non-pregnancy, as calculated from each study, are summarized in Table IV, please see addendum. Again, sensitivity and specificity points plotted in Figure 5 were heterogeneous between studies ({chi}2-test statistic: P-value for sensitivity 0.001 and P-value for specificity 0.001). The Spearman correlation coefficient for sensitivity and specificity was –0.82 and as such was sufficient to estimate a summary ROC curve (Figure 5).


View this table:
[in this window]
[in a new window]

 
Table IV. Performance of basal FSH in the prediction of non-pregnancy in IVF patients and shift from pre-test to post-test probability of pregnancy for patients with an abnormal (= lower than the threshold) FSH result

 


Figure 5
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of basal FSH in the prediction of non-pregnancy. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Clinical value
Based on the summary ROC curves depicted in Figure 4, a range of positive LRs was calculated and for each ratio the pre-FSH test probability of poor response and non-pregnancy was converted into a post-FSH-test probability. Table V, (please see addendum) depicts the probability of obtaining a certain FSH test result and the corresponding LR within different LR ranges for the prediction of poor response and non-pregnancy. At a maximum positive LR of 8, the post-FSH-test probability of poor response will approximate 70% if the pre-FSH-test probability is assumed to be as high as 20%. As is apparent from this table, the probability of obtaining a test result (FSH level) with an LR of ~8 is quite small. Table III shows that in women with an increased FSH level the probability of poor response only increases substantially (3-fold or more) in studies applying a high threshold level for FSH, resulting in a very limited number of patients with an abnormal test result.


View this table:
[in this window]
[in a new window]

 
Table V. The occurrence of the basal FSH results within a specified likelihoodratio (LR) range and the concomitant post-test probabilities of poor response and non-pregnancy, given a prevalence of poor response of 20% and non-pregnancy of 80%

 
Even more so, for prediction of non-pregnancy, the extremely high FSH levels that are necessary to obtain the moderate positive LR of ~5, leading to a post-test pregnancy rate of less than 5% based on a pre-test rate of 20%, again occur only in a very limited number of patients (Table V). Beyond the coordinate defined by specificity 0.90 and sensitivity 0.20, the summary ROC curve almost runs parallel to the line of equality. This indicates that this segment of the curve is 100% uninformative (LR ~1).

All this leads to the conclusion that with the use of basal FSH in regularly cycling women, accuracy in the prediction of poor response and non-pregnancy is adequate only at very high threshold levels, but because of the very low numbers of abnormal tests has hardly any clinical value. Considering this along with a false positive rate of ~ 5%, the test will not be suitable as a diagnostic test to exclude patients, but only as screening test for counselling purposes and further diagnostic steps, in which a first IVF attempt may be the step of choice (Roberts et al., 2005Go).

AMH

Systematic review
Through the search and selection strategy, two studies reporting on the predictive capacity of AMH and which were suitable for data extraction and meta-analysis were identified (van Rooij et al., 2002Go; Muttukrishna et al., 2004Go). Characteristics of the included studies are listed in addendum, Table VI.


View this table:
[in this window]
[in a new window]

 
Table VI. Characteristics of included studies on AMH (computerized search using the test-specific keywords anti-mullerian hormone or mullerian inhibiting factor or mullerian inhibiting substance)

 
Accuracy of poor response prediction
The sensitivities and specificities, the positive LR and the DOR for the prediction of poor ovarian response, as calculated from each study, are summarized in Table VII, (see addendum) and in Figure 6. Homogeneity could not be rejected for sensitivity and specificity ({chi}2-test statistic: P-value for sensitivity 0.12 and P-value for specificity 0.64), but this is merely because of the fact that only two studies were included. As can be seen from Figure 6, the points of the two studies can be thought of as originating from a single ROC curve (Spearman correlation coefficient between sensitivity and specificity is –0.81). The summary ROC curve that can be estimated from these points is also shown in Figure 6.


View this table:
[in this window]
[in a new window]

 
Table VII. Performance of AMH in the prediction of poor response in IVF patients and shift from pre-test to post-test probability of poor response for patients with an abnormal AMH result

 


Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of AMH in the prediction of poor response. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated. Reference lines indicate a desired level for sensitivity (0.75) and specificity (0.85).

 
Accuracy of non-pregnancy prediction
Sensitivities and specificities for the prediction of non-pregnancy by AMH, as calculated from each study, are summarized in Table VIII. As the study of Van Rooij was the only one detected, further meta-analysis is not useful. The ROC-curve derived from the data of Van Rooij et al. representing the accuracy of AMH in the prediction of non-pregnancy is shown in Figure 7.


View this table:
[in this window]
[in a new window]

 
Table VIII. Performance of AMH in the prediction of non-pregnancy in IVF patients and shift from pre-test to post-test probability of non-pregnancy for patients with an abnormal AMH result

 


Figure 7
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of AMH in the prediction of non-pregnancy. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated. Reference lines indicate a desired level for sensitivity (0.75) and specificity (0.85).

 
Clinical value
As data from only two studies are available, it is not feasible to extract data on the interrelation between positive LRs, post-test probabilities and the rate of abnormal tests. However, looking at the performance of AMH in the prediction of poor response, a desired level for sensitivity of 75% and for specificity of 85% would imply that the test performs only moderately, especially at the sensitivity level. For non-pregnancy prediction, a desired level of sensitivity of 40% and specificity of 95% would imply that the test has hardly any value, unless very low threshold levels would be used, which will certainly lead to only very small percentages of abnormal tests. Additional studies are to be awaited to learn whether test capacity may prove to be more superior than current tests like basal FSH and the AFC (Hazout et al., 2004Go; Muttukrishna et al., 2005Go; Penarrubia et al., 2005Go).

Inhibin B

Systematic review
We detected a total of nine studies reporting on the predictive capacity of inhibin-B and which were suitable for data extraction and meta-analysis (Balasch et al., 1996Go; Seifer et al., 1997Go; Hall et al., 1999Go; Creus et al., 2000Go; Fabregues et al., 2000Go; Penarrubia et al., 2000Go; Bancsi et al., 2002aGo; Fiçicioglu et al., 2003Go; Erdem et al., 2004Go). Characteristics of the included studies are listed in addendum Table IX. Variation among the definitions of poor response and study quality and design characteristics was clearly present but logistic regression analysis revealed that none of the items significantly impacted upon the predictive performance of the test. Subgroup analysis therefore was not indicated.


View this table:
[in this window]
[in a new window]

 
Table IX. Characteristics of included studies on inhibin B (computerized search using the test-specific keyword inhibin B)

 
Accuracy of poor response prediction
The sensitivities and specificities, the positive LR and the DOR for the prediction of poor ovarian response, as calculated from each study, are summarized in Table X, see addendum. Calculation of one summary point estimate for sensitivity and specificity was not meaningful, as both test characteristics, as plotted in Figure 8, were heterogeneous among studies ({chi}2-test statistic: P-value for sensitivity <0.001 and P-value for specificity 0.002). The Spearman correlation coefficient for sensitivity and specificity was sufficient to estimate a summary ROC curve (R = –0.93, Figure 8). In the figure, it is clearly seen that all but one study were close to the estimated ROC curve, and that one study reported a clearly better accuracy (Fiçicioglu et al., 2003Go). This study was of good quality, but reported on only a small number of patients.


View this table:
[in this window]
[in a new window]

 
Table X. Performance of inhibin B in the prediction of poor response in IVF patients and shift from pre-test to post-test probability of poor response for patients with an abnormal inhibin B result

 


Figure 8
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of inhibin B in the prediction of poor response. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Accuracy of non-pregnancy prediction
There were three studies that reported on the capacity of inhibin B to predict non-pregnancy. Sensitivities and specificities for the prediction of non-pregnancy, as calculated from each study, are summarized in Table XI. Sensitivity and specificity as plotted in Figure 9 were heterogeneous between studies ({chi}2-test statistic: P-value for sensitivity 0.004 and P-value for specificity <0.001). The Spearman correlation between sensitivity and specificity showed a coefficient of –0.94, sufficient to estimate a summary ROC curve.


View this table:
[in this window]
[in a new window]

 
Table XI. Performance of the inhibin B in the prediction of non-pregnancy in IVF patients and shift from pre-test to post-test probability of non-pregnancy for patients with an abnormal inhibin B result

 


Figure 9
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of inhibin B in the prediction of non-pregnancy. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Clinical value
Based on the summary ROC curves depicted in Figure 8, a range of positive LRs was calculated and for each ratio pre-inhibin B-test probabilities of poor response or non-pregnancy (20 and 80%, respectively) were converted into post-inhibin B-test probabilities. Table XII depicts the probability of obtaining a certain inhibin B test result and the corresponding LR, within different LR ranges for the prediction of poor response and non-pregnancy. At a very modest LR of 4, the post-inhibin B-test probability of poor response will not be higher than 55%, while the chance of obtaining such a test result is very small.


View this table:
[in this window]
[in a new window]

 
Table XII. The occurrence of the inhibin B results within a specified likelihoodratio (LR) range and the concomitant post-test probabilities of poor response and non-pregnancy, given a prevalence of poor response of 20% and non-pregnancy of 80%

 
For prediction of non-pregnancy, extreme threshold levels are necessary to obtain a modest positive likelihood ratio of ~4–5, leading to a post-test pregnancy rate of approximately 5%. Such abnormal test results occur only in a very limited number of patients, while the false positive rate will lead to unnecessary exclusions from IVF programs if the test is used in a diagnostic fashion.

With the use of basal inhibin B in regularly cycling women, the accuracy in the prediction of poor response and non-pregnancy is only modest at a very low threshold level. At best the test may be used as screening test for counselling purposes or to direct further diagnostic steps, like a first IVF attempt to observe the response to ovarian stimulation. Used in this way, the test may well be inferior to other tests discussed in this review.

Basal estradiol

Systematic review
We detected a total of 10 studies reporting on the predictive capacity of basal estradiol and which were suitable for data extraction and meta-analysis (Licciardi et al., 1995Go; Smotrich et al., 1995Go; Evers et al., 1998Go; Vazquez et al., 1998Go; Hall et al., 1999Go; Frattarelli et al., 2000Go; Penarrubia et al., 2000Go; Phophong et al., 2000Go; Mikkelsen et al., 2001Go; Ranieri et al., 2001Go; Bancsi et al., 2002aGo). Characteristics of the included studies are listed in addendum Table XIII. Again, variation among the definitions of poor response and study quality and design characteristics was clearly present, but logistic regression analysis revealed that none of the items significantly impacted upon the predictive performance of the test. Subgroup analysis therefore was not indicated.


View this table:
[in this window]
[in a new window]

 
Table XIII. Characteristics of included studies on basal estradiol (computerized search using the test-specific keyword estradiol)

 
Accuracy of poor response prediction
There were eight studies that reported on the prediction of poor response. The sensitivities and specificities, the positive LR and the DOR for the prediction of poor ovarian response, as calculated from each study, are summarized in Table XIV. Calculation of one summary point estimate for sensitivity and specificity was not meaningful, as both test characteristics as plotted in Figure 10 were heterogeneous among studies ({chi}2-test statistic: P-value for sensitivity <0.001 and P-value for specificity 0.002). The Spearman correlation coefficient for sensitivity and specificity was –0.50. As can be seen from Figure 10, this can be because of three outliers, which were extracted from the studies of Smotrich et al. and Ranieri et al. From neither the clinical nor the methodological point of view could a clear explanation be provided for the outliers. When correlation between sensitivity and specificity was assessed after exclusion of the three outliers, we found a very strong correlation (–0.94). Figure 10 shows two estimates of a summary ROC curve, one constructed with all data and one constructed after exclusion of the two studies with outlying data (Figure 10).


View this table:
[in this window]
[in a new window]

 
Table XIV. Performance of basal estradiol in the prediction of poor response in IVF patients and shift from pre-test to post-test probability of poor response for patients with an abnormal estradiol result

 


Figure 10
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 10. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of basal estradiol in the prediction of poor response. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Accuracy of non-pregnancy prediction
There were nine studies that reported on the capacity of basal estradiol to predict non-pregnancy after IVF. Sensitivities and specificities for the prediction of non-pregnancy, as calculated from each study, are summarized in Table XV. Again, sensitivity and specificity as plotted in Figure 11 were heterogeneous between studies ({chi}2-test statistic: P-value for sensitivity <0.001 and P-value for specificity <0.001). The Spearman correlation between sensitivity and specificity showed a coefficient of –0.89, sufficient to estimate a summary ROC curve (Figure 11). This summary ROC curve is almost parallel to the line x = y, indicating virtually no discriminative capacity.


View this table:
[in this window]
[in a new window]

 
Table XV. Performance of the basal estradiol in the prediction of non-pregnancy in IVF patients and shift from pre-test to post-test probability of non-pregnancy for patients with an abnormal Estradiol result

 


Figure 11
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 11. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of basal estradiol in the prediction of non-pregnancy. Studies reporting on several threshold points are represented by an equivalent number of sens–spec points. N in the legend refers to the number of cycles studied, which in some studies is equivalent to the number of couples treated.

 
Clinical value
Based on the two summary ROC curves for all studies depicted in Figure 10, a range of positive LRs was calculated and for each ratio, pre-estradiol-test probabilities of poor response or non-pregnancy (20 and 80%, respectively) were converted into post-estradiol-test probabilities. Table XVI (please see addendum) depicts the probability of obtaining a certain estradiol-test result and the corresponding LR, within different LR ranges for the prediction of poor response and non-pregnancy. At a moderate LR of 4–5, the post-estradiol-test probability of poor response will not be higher than ~50%, while the chance of obtaining such a test result is very small.


View this table:
[in this window]
[in a new window]

 
Table XVI. The occurrence of the basal estradiol results within a specified likelihoodratio (LR) range and the concomitant post-test probabilities of poor response and non-pregnancy, given a prevalence of poor response of 20% and non-pregnancy of 80%

 
For prediction of non-pregnancy no clear threshold levels can be identified for basal estradiol that will lead to an adequate combination of LR, post-test probability and abnormal test rate. This could be anticipated from the shape of the ROC curve in Figure 11

All this leads to the conclusion that the clinical applicability for basal estradiol as a test before starting IVF is prevented by the very low predictive accuracy, both for poor response and non-pregnancy.

AFC

Systematic review
Through the search and selection strategy, a total of 15 studies reporting on the predictive capacity of basal AFC and suitable for data extraction and meta-analysis were identified (Chang et al., 1998bGo; Frattarelli et al., 2000Go; Ng et al., 2000Go; Sharara and McClamrock, 2000Go; Hsieh et al., 2001Go; Nahum et al., 2001Go; Bancsi et al., 2002aGo; Erdem et al., 2002Go; Fisch and Sher, 2002Go; Fiçicioglu et al., 2003Go; Frattarelli et al., 2003Go; Jarvela et al., 2003Go; Kupesic et al., 2003Go; Yong et al., 2003Go; Durmusoglu et al., 2004Go). Characteristics of the included studies are listed in addendum Table XVII. Variation among the definitions of poor response and study quality and design characteristics is clearly present but logistic regression analysis revealed that none of the items significantly impacted upon the predictive performance of the test. Subgroup analysis therefore was not indicated.


View this table:
[in this window]
[in a new window]

 
Table XVII. Characteristics of included studies on basal AFC (computerized search using test-specific keywords antral follicle count or antral follicle number)

 
Accuracy of poor response prediction
The sensitivities and specificities, the positive LR and the DOR for the prediction of poor ovarian response, as calculated from each study, are summarized in Table XVIII. Calculation of one summary point estimate for sensitivity and specificity was not meaningful, as both test characteristics as plotted in Figure 12 were heterogeneous among studies ({chi}2-test statistic: P-value for sensitivity 0.001 and P-value for specificity 0.001). The Spearman correlation coefficient for sensitivity and specificity was –0.57 and was judged to be sufficient to estimate a summary ROC curve (Figure 12).


View this table:
[in this window]
[in a new window]

 
Table XVIII. Performance of basal AFC in the prediction of poor response in IVF patients and shift from pre-test to post-test probability of poor response for patients with an abnormal AFC result

 


Figure 12
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 12. Estimated ROC curve and sensitivity–specificity points for all studies reporting on the performance of the AFC in the prediction