L-Galaxies, Munich Galaxy Formation Model

MCMC Sampling

  1. Compiling & Running
  2. In order to perform the MCMC sampling the code should be compiled using My_Makefile_options_MCMC. To do so, in Makefile, replace "My_Makefile_options" with: "My_Makefile_options_MCMC" in lines 23 and 40 of Makefile. Then simply run the executable file generated using an mcmc input.par file as a run time argument. MCMC input files have "mcmc" in the name and follow the same naming convention as other input files.

    Two input files are currently available:

    • ./input/MCMC_inputs/input_Henriques15_mcmc_MR_W1_PLANCK.par: mcmc sampling for the standard Henriques2015 model using a sample of trees from the Millennium simulation.
    • ./input/MCMC_inputs/input_Henriques15_mcmc_MR_plus_MRII_W1_PLANCK.par: mcmc sampling for the standard Henriques2015 model using a sample of trees from the Millennium and Millennium II simulations. This allows the model to be constrained over a much broader dynamical range.

    There are also bash scripts available to run the code in parallel, in ./AuxCode/Run/, e.g.: cosma.bash. After the burn out phase it is equivalent to have a single MCMC chain with a lot of steps, or a lot of chains with a few steps. Since the galaxy formation model takes a relatively long time to run (from 5 to 10 minutes) it is better to run smaller chains in multiple cores. Typically, one can get convergence after 2 days on 100 cores for 10 parameters.

    A fundamental difference between running the code in 'MCMC' or 'normal' mode is the output produced. In 'normal mode', the code runs once in each dark matter sub-volume and one galaxy catalogue is produced per sub-volume. In 'MCMC mode', the code runs in a representative set of dark matter merger trees, for a large number of times, and the output is a single file (per processor), with a list of parameter values and a corresponding likelihood. If the code is not run in parallel, a file is written in the output directory with the comparison between the model and each observational constraint. This can be used for debugging.

  3. Quick Guide to include more Parameters or Observational Constraints:
  4. If you are just starting to use the MCMC you'll likely want to try to constrain different parameters in the model with different observational constrains. It is recommended that you start with what is currently available and from which you can easily choose by using the parameters switches at ./input/MCMC_inputs/MCMCParameterPriorsAndSwitches.txt, the observational constraints switches at ./input/MCMC_inputs/MCMCObsConstraints.txt and the observational constraints weights at ./input/MCMC_inputs/MCMCWeightsObsConstraints.txt. All these are described in the next section and allow you to select any combination of parameters to be sampled, and any combination of pre-defined constraints, with different weights, to be used at any redshifts.

  5. Makefile & Input Options
  6. 3.1 Makefile options:

    Makefile options for the MCMC are simple. Once the Makefile has been changed to compile using My_Makefile_options_MCMC that has OPT += -DMCMC on. That, in combination with the right input file, switches the code into MCMC mode. In addition, there is the option to use a sample both with Millennium and MillenniumII merger trees: OPT += -DMR_PLUS_MRII (which requires the appropriate input file).

    When OPT += -DMCMC is selected an additional OPT += -DHALOMODEL option is available. This will compute satellite profiles using a halo model, allowing the correlation function of galaxies to be accurately calculated from a very small subset of dark matter haloes. This makes it possible to use clustering measurements as observational constraints in the MCMC sampling. Further details are given in section Halo Model below.

    3.2 MCMC input files:

    A large number of additional inputs are needed in MCMC mode.

    3.2.1 Parameters to be sampled:

    • MCMCParameterPriorsAndSwitches: This file list the parameters sampled in the MCMC. It contains priors and a switch for each parameter.
    • A parameter is only used in the sampling if the switch is turned to 1. For parameters with switch=1, the values are not read from input_mcmc_***.par.
    • MCMCStartingParFile: File containing starting values for all parameters to be sampled. Parameters are only read from this file if the output from previous runs is not available at ./output/senna_g*_**.txt . The number of values in this file must correspond to the numbers of parameters with switch=1.
    • If there is a previous output available at ./output/senna_g*_**.txt the starting values are read from these files and set to the last available step. Therefore, when re-starting in mcmc mode, the code continues from where it stopped (MCMC_Initial_Par_DisplacementMCMC_LogStep_Size=0 when re-starting).

    3.2.2 Observational constraints:

    • MCMCObsConstraints: file containing the names of observational constraints to be used.
    • The first number in the file is the maximum number of observational constraints available (it must be the same as MCMCNConstraints in mcmc_vars.h). Then, there is the number of redshifts for which constraints are available (must be smaller than NOUT) and the values for those redshifts (these must correspond to the outputs of the galaxy formation model listed in ./input/MCMC_inputs/desired_output_redshifts_mcmc.txt). For each observational constraint a type of test is listed, in addition to a switch for each redshift available.

    • MCMCWeightsObsConstraints: file specifying a weight for each observational constraint at each redshift.
    • ObsConstraintsDir: folder containing the data for the observational constraints to be used.
    • There must a file in this directory corresponding to each constraint and redshift listed in MCMCObsConstraints, e.g. StellarMassFunction_z0.10.txt. The files contain binned values in standard format: x_bin_low, x_bin_high, y_value, err_y_value, with the first line in each file containing the number of bins.

    3.2.3 General inputs:

    • ChainLength: number of steps for each chain.
    • Time_Dependent_PhysPar: switch to allow parameters values to vary between desired output redshifts.
    • MCMCMode: 0 for normal MCMC, 1 to only accept parameters sets that give higher likelihood and find the maximum quicker (likely to get stuck in a local maximum if a small number of chains is used).
    • MCMC_LogStep_Size : size of the log_normal step. Regulate in order to get a final acceptance rate between 10-40%.
    • MCMC_Initial_Par_DisplacementMCMC_LogStep_Size: initial displacement to be applied to all parameters in log_normal space. Ideally large to start each chain in very different places. In practice it cannot be too large as it will cause a lot of the chains to wonder to zero likelihood regions.
    • MCMC_Minimum_Obs_Error: force a minimum error on observational constraints (a value of 10% is probably a good idea).

    3.2.4 Dark matter merger trees:

    In MCMC mode, in order to run the galaxy formation model in a reasonable time, a representative set of dark matter merger trees is used instead of the full volume of the N-body simulations. Therefore, there is a dark matter tree file containing the set of pre-selected trees (see Appendix B of Henriques et al., 2013, MNRAS, 431, 3373) and an MCMCSampleFile containing the IDs and weights of the FOF groups to be selected from those trees.

    Name of sample file containing IDs and weights: 'MCMCSampleFilePrefix'_sample_allz_nh_'MCMCSampleFile'snapnum.txt
    • MCMCSampleDir: directory for sample file
    • MCMCSampleFilePrefix: prefix of sample file
    • MCMCSampleFile: number of trees in sample file
    • MCMCTreeSampleFile: name of file containing the representative sample of dark matter merger trees.

  7. Analysing the Output
  8. For each core, an output file is written into the output directory: senna_gNN_ii.txt. Where NN is the last character of the route directory of the model and ii is the core number. In each of this files, one line will be printed per MCMC step containing: \(weight (\rm{normally}=1), -log_{10}(\rm{likelihood}), log_{10}(P1), log_{10}(P2), ..., log_{10}(Pn)\) .

    The output can be analysed using the public COSMO_MC package includes contains a number of very useful tests. Alternatively, mcmc_read_chains.pro has simple idl scripts to produce plots of 1D marginalised regions and of some parameter efficiencies.

    "sort -r -nk2 ./output/senna*" will sort all the output steps of the MCMC, giving the parameters for the highest likelihood value last.

  9. MCMC in depth
  10. 5.1 Including New Parameters:

    If you would like to include additional parameters into the sampling that is relatively straightforward. If the parameter was not present before in the semi-analytic model you will have to include it in the input_***.par file and read it in in read_parameters.c. In addition, you will have to pass the parameter value from the MCMC to the semi-analytic model in function void read_mcmc_par (int snapnum). Finally, you will have to include the new parameter in the input file ./input/MCMC_inputs/MCMCParameterPriorsAndSwitches.txt and increase the value for the number of parameters at the top.

    5.2 Including New Observational Constraints:

    Including a new observational constraint is slightly more complex and basically involves changes in the input files, mcmc_likelihood() and mcmc_save.c(). The first thing to do is to increase the number of observational constraints defined in mcmc_vars.h: "#define MCMCNConstraints" and at the top of ./input/MCMC_inputs/MCMCObsConstraints.txt and ./input/MCMC_inputs/MCMCWeightsObsConstraints.txt. Then, the new observational constraint(s) must be included in these two input files. A type of statistical test needs to be specified together with switches and weights for all output redshifts. Then a file with the same name specified in these files must be added to the directory ./ObsConstraints/ for all the redshifts to be used.

    After the inputs, mcmc_save.c() and mcmc_likelihood() must be changed. If a property not used before by the mcmc is needed, this must be saved in mcmc_save.c (section mcmc_save below). Then, a method of binning the theoretical predictions and a statistical test must be included in mcmc_likelihood(). Section mcmc_likelihood below explains how to use the functions already present in the code and how to include new ones.

    5.3 MCMC files in depth:

    The code related to the MCMC sampling is divided into 4 main files:

    • mcmc.c: file containing most of the functions that control the mcmc. These include a main function called senna() that controls the sampling and calls the galaxy formation model, sam(), and analyses the returned value of likelihood. It also contains functions to read information about the observational constrains and the dark matter sample of trees and to propose new values for the parameters at each step.
    • mcmc_likelihood.c: file containing the functions binning model galaxies and performing tests to compare the model with the different observational constraints.
    • mcmc_save.c: file containing the function that passes the properties needed to compare the model with observations from the structure inside the galaxy formation model to the structure used by the MCMC: MCMC_GAL (doing the necessary conversion of units). At the moment the MCMC works with units that include \(h\) factors.
    • mcmc_vars.h: all variables related to the mcmc are defined here, including the galaxy structure used by the mcmc: MCMC_GAL.

    In addition, mcmc_proto.h contains the headers for all the functions used in the MCMC.

    5.3.1 mcmc.c:

    mcmc.c contains most functions used by the MCMC sampling including the 'main' function for the method: Senna(). In MCMC mode, Senna() is the function controlling the entire code (instead of main.c used in normal mode).

    The first function to be called inside Senna() is initialize_mcmc_par_and_lhood(), which reads the initial values for all the parameters and a likelihood either from a previous output (in ./output/senna_g*_**.txt) or from MCMCStartingParFile. Next, read_observations() reads all the observational constraints into memory. initialize_mcmc_par_and_lhood() and read_observations() are read only once at the beginning. read_sample_info() reads information about the IDs and weights of the pre selected sample of FOF groups. In principle this could be read only once and that is in fact the case if only one simulation is used. If more than one simulation is used, currently if OPT += -DMR_PLUS_MRII, the function must be called twice at each MCMC step to switch between simulations.

    After the initialization has been done, the MCMC sampling starts. In a for loop that goes from 1 to ChainLength the semi-analytic model is run at each step, with galaxy properties being computed for a given set of parameters (passed from the mcmc to the semi-analytic model at read_mcmc_par()), and mcmc_likelihood() being called in the end to return a joint likelihood for the model at the current step.

    If the likelihood for the model at the current step is higher than at the previous step, the current set of parameters is accepted. If the likelihood is lower than in the previous step, the new set of parameter might still be accepted, with a probability given by the ratio of the two likelihoods. In any new case, a new set of parameters is proposed in propose_new_parameters() and passed into the semi-analytic model to re-start the process.

    5.3.2 mcmc_likelihood.c:

    mcmc_likelihood.c is the place where theoretical predictions are binned like observational functions and then compared with them by means of a statistical test that returns a likelihood. There are switches for the different observational constraints and the different redshifts that define if a given test is performed or not.

    There are currently three tests available: \(\chi^2\), maximum likelihood and binomial (although the binomial likelihood is not correctly calculated yet). You can easily include more tests in this function by looking at how the current ones are implemented.

    If you want to include additional constraints using the tests already provided you just need to bin the property you want to compare with observations in the correct way. If you want to compare a luminosity or mass function of a 'simple property', then you just need to make sure that property is saved in mcmc_save.c and use bin_function and the chi_square test. However in many cases you will need to include your own bin function. A detailed description of the tests can be found in Henriques et al., 2013, MNRAS, 431, 3373). For example, in the example already included for the stellar mass function of red and blue galaxies, it is necessary to divide galaxies into a red and blue population according to some criteria before binning.

    5.3.3 mcmc_save.c:

    This file is where the code checks if the current galaxy is in an FOF group that was pre-selected (the tree file contains all the FOFs in a tree, a bush). If that is the case galaxy properties are passed from HaloGal to MCMC_GAL. If you need additional properties, that are computed by the galaxy formation model but not listed here, to constrain the model with additional observations they must be passed between structures here. Additional h factors are included here. MCMC_GAL has galaxy properties for all the galaxies at all the output redshifts. The maximum number of galaxies much be pre-defined (MCMCAllocFactor) as it varies from redshift to redshift.

    5.3.4 mcmc_vars.h:

    This file contains the structure MCMC_GALAXY with all the galaxy properties available to the mcmc_likelihood() to compare with observational constraints. If you need additional properties, that are computed by the galaxy formation model but not listed here, to constrain the model with additional observations they must be included in this structure.

  11. Halo Model
  12. When OPT += -DHALOMODEL is switched on, HODs and satellite profiles are computed and used as input to a halo model, allowing the correlation function of all galaxies to be accurately estimated from a very small subset of dark matter haloes. This makes it possible to use clustering measurements as observational constraints in the MCMC sampling. The method is described in van Daalen et al., 2016.

    Besides MCMC mode, the halo model requires an additional line in the input file for MCMCHaloModelDir, which for the public version of the code should point to ./MCMC/HaloModel (see e.g. ./input/MCMC_inputs/input_Guo13_mcmc_halomodel_MR_W1_W1.par). This path contains the input power spectrum (powrealized_rebin_corrected.dat), the FoF mass function for M200mean masses (fofnum_m200mean.dat), and corrections to the power spectrum to account for halo triaxiality and alignments (ellip_corr.dat and align_corr.dat).

    The provided FoF mass function and input power spectrum are based on WMAP1, and are scaled with cosmology inside the code. If your run is not based on the WMAP1 Millennium simulation, these files should be replaced with your own. The corrections for halo shape and alignment are assumed to be independent of cosmology.

    As implemented here, the clustering estimator is not intended to be used in combination with Millennium II (MRII). To use it in accordance with such a higher-resolution simulation, the FoF mass function should be updated, the number of massbins should be increased (up from massbins=65), and the minimum FoF mass considered should be decreased (down from minfofmass=9.5). The latter two parameters can be found in mcmc_vars.h. The minimum FoF mass should be such that the number of FoF groups in the first bin is zero.

    When using the clustering estimator, we advise also setting OPT += -DPROJLIMITS, which calculates the projected correlation function in a way that is more consistent with observations. Additional options are OPT += -DOUTPUTPOW, OPT += -DOUTPUTCORR and OPT += -DOUTPUTPROJ, to output the sample-estimated power spectra, correlation functions and projected correlation functions, respectively.

    Finally, note that the clustering estimator assumes redshift zero. If the observational clustering constraints used are at higher redshift, the FoF mass function at that redshift should be provided, and a growth factor correction for the cosmological parameter Sigma8 should be included. Additional adjustments (such as changes to the halo shape correction) may also prove necessary.

    Files used by HALOMODEL:

    • mcmc_halomodel.c: Contains all the functions of the clustering estimator, including those used to calculate the HOD, satellite profile and resulting clustering functions.
    • mcmc_halomodel.h: Contains definitions for all functions in mcmc_halomodel.c.
    • mcmc_mpfit.c: MPFIT routine, based on MINPACK-1, taken directly from Craig Markwardt's page.
    • mcmc_mpfit.h: The header for MPFIT, also taken directly from the above page.

    6.1 mcmc_likelihood.c:

    The first function called is void initialize_halomodel(), which reads the input power spectrum, the FoF mass function and the halo shape and alignments corrections, and spline-interpolates all of these. This function is called only once.

    At each step of the MCMC chain, calls to the estimator's main function, void halomodel(double* r,double* proj,float masslimit_low,float masslimit_high,int snapnum), are made. After each call, proj will contain the estimated projected correlation function at radii r (in units of Mpc) for galaxies with stellar masses between masslimit_low and masslimit_high (in units of M). This function first calls void init_numgal(float masslimit_low,float masslimit_high,int snapnum) to calculate both the HOD and satellite profiles based on the galaxies in the sample haloes. Both are calculated in bins of halo mass. The HOD is spline-fitted, while the satellite profile is fit to with a function defined in van Daalen et al., 2016 (equation 12).

    Next, for each wavenumber k the 1-halo and 2-halo terms of the galaxy power spectrum are calculated, using calls to double Mcensat(double k,int haloterm,int censat) which in turns uses the HOD and satellite profiles of the galaxies (considering central and satellite galaxies separately). The resulting power spectrum is corrected for halo shape and alignments, and then given an additional correction for halo exclusion.

    Finally, the galaxy power spectrum is spline-fitted and convolved to produce the galaxy correlation function, which is then convolved to produce the projected galaxy correlation function.