Title: | Random Network Model Estimation, Selection and Parameter Tuning |
---|---|
Description: | Model selection and parameter tuning procedures for a class of random network models. The model selection can be done by a general cross-validation framework called ECV from Li et. al. (2016) <arXiv:1612.04717> . Several other model-based and task-specific methods are also included, such as NCV from Chen and Lei (2016) <arXiv:1411.1715>, likelihood ratio method from Wang and Bickel (2015) <arXiv:1502.02069>, spectral methods from Le and Levina (2015) <arXiv:1507.00827>. Many network analysis methods are also implemented, such as the regularized spectral clustering (Amini et. al. 2013 <doi:10.1214/13-AOS1138>) and its degree corrected version and graphon neighborhood smoothing (Zhang et. al. 2015 <arXiv:1509.08588>). It also includes the consensus clustering of Gao et. al. (2014) <arXiv:1410.5837>, the method of moments estimation of nomination SBM of Li et. al. (2020) <arxiv:2008.03652>, and the network mixing method of Li and Le (2021) <arxiv:2106.02803>. It also includes the informative core-periphery data processing method of Miao and Li (2021) <arXiv:2101.06388>. The work to build and improve this package is partially supported by the NSF grants DMS-2015298 and DMS-2015134. |
Authors: | Tianxi Li [aut, cre], Elizeveta Levina [aut], Ji Zhu [aut], Can M. Le [aut] |
Maintainer: | Tianxi Li <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.7 |
Built: | 2024-11-14 03:43:43 UTC |
Source: | https://github.com/cran/randnet |
The package provides model fitting and estimation functions for some popular random network models. More importantly, it implements a general cross-validation framework for model selection and parameter tuning called ECV. Several other model selection methods are also included. The work to build and improve this package is partially supported by the NSF grants DMS-2015298 and DMS-2015134.
Package: | randnet |
Type: | Package |
Version: | 0.7 |
Date: | 2023-05-11 |
License: | GPL (>= 2) |
Tianxi Li, Elizaveta Levina, Ji Zhu, Can M. Le
Maintainer: Tianxi Li <[email protected]>
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
K. Chen and J. Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association, 113(521):241-251, 2018.
K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, pages 1878-1915, 2011.
A. A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097-2122, 2013.
Qin, T. & Rohe, K. Regularized spectral clustering under the degree-corrected stochastic blockmodel Advances in Neural Information Processing Systems, 2013, 3120-3128
J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215-237, 2014.
C. M. Le, E. Levina, and R. Vershynin. Concentration and regularization of random graphs. Random Structures & Algorithms, 2017.
S. J. Young and E. R. Scheinerman. Random dot product graph models for social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 138-149. Springer, 2007.
C. M. Le and E. Levina. Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827, 2015.
Zhang, Y.; Levina, E. & Zhu, J. Estimating network edge probabilities by neighbourhood smoothing Biometrika, Oxford University Press, 2017, 104, 771-783
B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
Wang, Y. R. & Bickel, P. J. Likelihood-based model selection for stochastic block models The Annals of Statistics, Institute of Mathematical Statistics, 2017, 45, 500-528
Gao, C.; Ma, Z.; Zhang, A. Y. & Zhou, H. H. Achieving optimal misclassification proportion in stochastic block models The Journal of Machine Learning Research, JMLR. org, 2017, 18, 1980-2024
T. Li, E. Levina, and J. Zhu. Community models for networks observed through edge nominations. arXiv preprint arXiv:2008.03652 (2020).
T. Li and C. M. Le, Network Estimation by Mixing: Adaptivity and More. arXiv preprint arXiv:2106.02803, 2021.
R. Miao and T. Li. Informative core identification in complex networks. arXiv preprint arXiv:2101.06388, 2021
Sischka, B. and Kauermann, G., 2022. EM-based smooth graphon estimation using MCMC and spline-based approaches. Social Networks, 68, pp.279-295.
Estimates the number of communities under block models by using the spectral properties of network Beth-Hessian matrix with moment correction.
BHMC.estimate(A, K.max = 15)
BHMC.estimate(A, K.max = 15)
A |
adjacency matrix of the network |
K.max |
the maximum possible number of communities to check |
Note that the method cannot distinguish SBM and DCSBM. But it works under either model.
A list of result
K |
Estimated K |
values |
eigenvalues of the Beth-Hessian matrix |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
C. M. Le and E. Levina. Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827, 2015.
LRBIC
,ECV.Block
, NCV.select
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A bhmc <- BHMC.estimate(A,15) bhmc
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A bhmc <- BHMC.estimate(A,15) bhmc
Generates networks from degree corrected stochastic block model, with various options for node degree distribution
BlockModel.Gen(lambda, n, beta = 0, K = 3, w = rep(1, K), Pi = rep(1, K)/K, rho = 0, simple = TRUE, power = TRUE, alpha = 5, degree.seed = NULL)
BlockModel.Gen(lambda, n, beta = 0, K = 3, w = rep(1, K), Pi = rep(1, K)/K, rho = 0, simple = TRUE, power = TRUE, alpha = 5, degree.seed = NULL)
lambda |
average node degree |
n |
size of network |
beta |
out-in ratio: the ratio of between-block edges over within-block edges |
K |
number of communities |
w |
not effective |
Pi |
a vector of community proportion |
rho |
proportion of small degrees within each community if the degrees are from two point mass disbribution. rho >0 gives degree corrected block model. If rho > 0 and simple=TRUE, then generate the degrees from two point mass distribution, with rho porition of 0.2 values and 1-rho proportion of 1 for degree parameters. If rho=0, generate from SBM. |
simple |
Indicator of wether two point mass degrees are used, if rho > 0. If rho=0, this is not effective |
power |
Whether or not use powerlaw distribution for degrees. If FALSE, generate from theta from U(0.2,1); if TRUE, generate theta from powerlaw. Only effective if rho >0, simple=FALSE. |
alpha |
Shape parameter for powerlaw distribution. |
degree.seed |
Can be a vector of a prespecified values for theta. Then the function will do sampling with replacement from the vector to generate theta. It can be used to control noise level between different configuration settings. |
A list of
A |
the generated network adjacency matrix |
g |
community membership |
P |
probability matrix of the network |
theta |
node degree parameter |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
A. A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097-2122, 2013.
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE)
community detection by concensus (majority voting) initialized by regularized spectral clustering
ConsensusClust(A,K,tau=0.25,lap=TRUE)
ConsensusClust(A,K,tau=0.25,lap=TRUE)
A |
adjacency matrix |
K |
number of communities |
tau |
reguarlization parameter for regularized spectral clustering. Default value is 0.25. Typically set between 0 and 1. If tau=0, no regularization is applied. |
lap |
indicator. If TRUE, the Laplacian matrix for initializing clustering. If FALSE, the adjacency matrix will be used. |
Community detection algorithm by majority voting algorithm of Gao et. al. (2016). When initialized by regularized spectral clustering, it is shown that the clustering accuracy of this algorithm gives minimax rate under the SBM. However, it can slow compared with spectral clustering.
cluster labels
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li <[email protected]>
Gao, C.; Ma, Z.; Zhang, A. Y. & Zhou, H. H. Achieving optimal misclassification proportion in stochastic block models The Journal of Machine Learning Research, JMLR. org, 2017, 18, 1980-2024
dt <- BlockModel.Gen(15,50,K=2,beta=0.2,rho=0) A <- dt$A cc <- ConsensusClust(A,K=2,lap=TRUE)
dt <- BlockModel.Gen(15,50,K=2,beta=0.2,rho=0) A <- dt$A cc <- ConsensusClust(A,K=2,lap=TRUE)
Estimates DCSBM model by given community labels
DCSBM.estimate(A, g)
DCSBM.estimate(A, g)
A |
adjacency matrix |
g |
vector of community labels for the nodes |
Estimation is based on maximum likelhood.
A list object of
Phat |
estimated probability matrix |
B |
the B matrix with block connection probability, up to a scaling constant |
Psi |
vector of of degree parameter theta, up to a scaling constant |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
SBM.estimate
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) est <- DCSBM.estimate(A,ssc$cluster)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) est <- DCSBM.estimate(A,ssc$cluster)
Model selection by ECV for SBM and DCSBM. It can be used to select between the two models or given on model (either SBM or DCSBM) and select K.
ECV.block(A, max.K, cv = NULL, B = 3, holdout.p = 0.1, tau = 0, dc.est = 2, kappa = NULL)
ECV.block(A, max.K, cv = NULL, B = 3, holdout.p = 0.1, tau = 0, dc.est = 2, kappa = NULL)
A |
adjacency matrix |
max.K |
largest possible K for number of communities |
cv |
cross validation fold. The default value is NULL. We recommend to use the argument B instead, doing indpendent sampling. |
B |
number of replications |
holdout.p |
testing set proportion |
tau |
constant for numerical stability only. Not useful for current version. |
dc.est |
estimation method for DCSBM. By defaulty (dc.est=2), the maximum likelihood is used. If dc.est=1, the method used by Chen and Lei (2016) is used, which is less stable according to our observation. |
kappa |
constant for numerical stability only. Not useful for current version. |
The current version is based on a simple matrix completion procedure, as described in the paper. The performance can be improved by better matrix completion method that will be implemented in next version. Moreover, the current implementation is better in computational time but less efficient in memory. Another memory efficient implementation will be added in next version.
impute.err |
average validaiton imputation error |
l2 |
average validation L_2 loss under SBM |
dev |
average validation binomial deviance loss under SBM |
auc |
average validation AUC |
dc.l2 |
average validation L_2 loss under DCSBM |
dc.dev |
average validation binomial deviance loss under DCSBM |
sse |
average validation SSE |
l2.model |
selected model by L_2 loss |
dev.model |
selected model by binomial deviance loss |
l2.mat , dc.l2.mat , ...
|
cross-validation loss matrix for B replications |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
NCV.select
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ecv <- ECV.block(A,6,B=3) ecv$l2.model ecv$dev.model which.min(ecv$l2) which.min(ecv$dev) which.min(ecv$dc.l2) which.min(ecv$dc.dev) which.max(ecv$auc) which.min(ecv$sse)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ecv <- ECV.block(A,6,B=3) ecv$l2.model ecv$dev.model which.min(ecv$l2) which.min(ecv$dev) which.min(ecv$dc.l2) which.min(ecv$dc.dev) which.max(ecv$auc) which.min(ecv$sse)
selecting tuning parameter for neighborhood smoothing estimation of graphon model where the tuning parameter is to control estimation smoothness.
ECV.nSmooth.lowrank(A, h.seq, K, cv = NULL, B = 3, holdout.p = 0.1)
ECV.nSmooth.lowrank(A, h.seq, K, cv = NULL, B = 3, holdout.p = 0.1)
A |
adjacency matrix |
h.seq |
a sequence of h values to tune. It is suggested h should be in the order of sqrt(log(n)/n). |
K |
the optimal rank for approximation. Can be obtained by rank selection of ECV. |
cv |
cross-validation fold. Recomend to use replication number B instead. |
B |
independent replication number of random splitting |
holdout.p |
proportion of test sample |
The neighborhood smoothing estimation can be slow, so the ECV may take long even for moderately large networks.
a list object with
err |
average validation error for h.seq |
min.index |
index of the minimum error |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
set.seed(500) N <- 300 U = matrix(1:N,nrow=1) / (N+1) V = matrix(1:N,nrow=1) / (N+1) W = (t(U))^2 W = W/3*cos(1/(W + 1e-7)) + 0.15 upper.index <- which(upper.tri(W)) A <- matrix(0,N,N) rand.ind <- runif(length(upper.index)) edge.index <- upper.index[rand.ind < W[upper.index]] A[edge.index] <- 1 A <- A + t(A) diag(A) <- 0 h.seq <- sqrt(log(N)/N)*seq(0.5,5,by=0.5) ecv.nsmooth <- ECV.nSmooth.lowrank(A,h.seq,K=2,B=3) h <- h.seq[ecv.nsmooth$min.index]
set.seed(500) N <- 300 U = matrix(1:N,nrow=1) / (N+1) V = matrix(1:N,nrow=1) / (N+1) W = (t(U))^2 W = W/3*cos(1/(W + 1e-7)) + 0.15 upper.index <- which(upper.tri(W)) A <- matrix(0,N,N) rand.ind <- runif(length(upper.index)) edge.index <- upper.index[rand.ind < W[upper.index]] A[edge.index] <- 1 A <- A + t(A) diag(A) <- 0 h.seq <- sqrt(log(N)/N)*seq(0.5,5,by=0.5) ecv.nsmooth <- ECV.nSmooth.lowrank(A,h.seq,K=2,B=3) h <- h.seq[ecv.nsmooth$min.index]
estimates the optimal low rank model for a network
ECV.Rank(A, max.K, B = 3, holdout.p = 0.1, weighted = TRUE,mode="directed")
ECV.Rank(A, max.K, B = 3, holdout.p = 0.1, weighted = TRUE,mode="directed")
A |
adjacency matrix |
max.K |
maximum possible rank to check |
B |
number of replications in ECV |
holdout.p |
test set proportion |
weighted |
whether the network is weighted. If TRUE, only sum of squared errors are computed. If FALSE, then treat the network as binary and AUC will be computed along with SSE. |
mode |
Selectign the mode of "directed" or "undirected" for cross-validation. |
AUC is believed to be more accurate in many simulations for binary networks. But the computation of AUC is much slower than SSE, even slower than matrix completion steps.
Note that we do not have to assume the true model is low rank. This function simply finds a best low-rank approximation to the true model.
A list of
sse.rank |
rank selection by SSE loss |
auc.rank |
rank selection by AUC loss |
auc |
auc sequence for each rank candidate |
sse |
sse sequence for each rank candidate |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ecv.rank <- ECV.Rank(A,6,weighted=FALSE,mode="undirected") ecv.rank
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ecv.rank <- ECV.Rank(A,6,weighted=FALSE,mode="undirected") ecv.rank
identify the informative core component of a network based on the spectral method of Miao and Li (2021). It can be used as a general data processing function for any network modeling purpose.
InformativeCore(A,r=3)
InformativeCore(A,r=3)
A |
adjacency matrix. It does not have to be unweighted. |
r |
the rank for low-rank denoising. The rank can be selected by ECV or any other methods availale in the package. |
The function can be used as a general data processing function for any network modeling purpose. It automatically identify an informative core component with interesting connection structure and a noninformative periphery component with uninterestings structures. Depending on the user's preference, the uninteresting structure can be either the Erdos-Renyi type connections or configuration type of connections, both of which are generally regarded as noninformative structures. Including these additional non-informative structures in network models can potentially lower the modeling efficiency. Therefore, it is preferable to remove them and only focus on the core structure. Details can be found in the reference.
A list of
er.score |
A n dimensional vector of informative scores for ER-type periphery. A larger score indicates the corresponding node is more likely to be in the core. |
config.score |
A n dimensional vector of informative scores for configuration-type periphery. A larger score indicates the corresponding node is more likely to be in the core. |
er.theory.core |
The indices of identified core structure in the ER-type model based on a theoretical threshold of the scores (for large sample size). |
config.theory.core |
The indices of identified core structure in the configuration-type model based on a theoretical threshold of the scores (for large sample size). |
er.kmeans.core |
The indices of identified core structure in the ER-type model based on kmeans clustering of the scores. |
config.kmeans.core |
The indices of identified core structure in the configuration-type model based on kmeans clustering of the scores (for large sample size). |
Tianxi Li, Elizaveta Levina, Ji Zhu, Can M. Le
Maintainer: Tianxi Li [email protected]
R. Miao and T. Li. Informative core identification in complex networks. arXiv preprint arXiv:2101.06388, 2021
set.seed(100) dt <- BlockModel.Gen(60,1000,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) ### this is not an interesting case -- only for demonstration of the usage. ### The network has no periphery nodes, all nodes are in the core. A <- dt$A core.fit <- InformativeCore(A,r=3) length(core.fit$er.theory.core) ### essentially all nodes are selected as the core.
set.seed(100) dt <- BlockModel.Gen(60,1000,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) ### this is not an interesting case -- only for demonstration of the usage. ### The network has no periphery nodes, all nodes are in the core. A <- dt$A core.fit <- InformativeCore(A,r=3) length(core.fit$er.theory.core) ### essentially all nodes are selected as the core.
selecting number of communities by asymptotic likelihood ratio based the methdo of Wang and Bickel 2015
LRBIC(A, Kmax, lambda = NULL, model = "both")
LRBIC(A, Kmax, lambda = NULL, model = "both")
A |
adjacency matrix |
Kmax |
the largest possible number of communities to check |
lambda |
a tuning parameter. By default, will use the number recommended in the paper. |
model |
selecting K under which model. If set to be "SBM", the calculation will be done under SBM. If set to be "DCSBM", the calculation will be done under DCSBM. The default value is "both" so will give two selections under SBM and DCSBM respectively. |
Note that the method cannot distinguish SBM and DCSBM, though different calculation is done under the two models. So it is not valid to compare across models. The theoretical analysis of the method is done under maximum likelhood and variational EM. But as suggested in the paper, we use spectral clustering for community detection before fitting maximum likelhood.
a list of
SBM.K |
estimated number of communities under SBM |
DCSBM.K |
estimated number of communities under DCSBM |
SBM.BIC |
the BIC values for the K sequence under SBM |
DCSBM.BIC |
the BIC values for the K sequence under DCSBM |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
Wang, Y. R. & Bickel, P. J. Likelihood-based model selection for stochastic block models The Annals of Statistics, Institute of Mathematical Statistics, 2017, 45, 500-528
BHMC.estimate
, ECV.block
, NCV.select
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ### test LRBIC lrbic <- LRBIC(A,6,model="both") lrbic$SBM.K lrbic$DCSBM.K
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ### test LRBIC lrbic <- LRBIC(A,6,model="both") lrbic$SBM.K lrbic$DCSBM.K
estimates inner product latent space model by projected gradient descent from the paper of Ma et al. (2020).
LSM.PGD(A, k,step.size=0.3,niter=500,trace=0)
LSM.PGD(A, k,step.size=0.3,niter=500,trace=0)
A |
adjacency matrix |
k |
the dimension of the latent position |
step.size |
step size of gradient descent |
niter |
maximum number of iterations |
trace |
if trace > 0, the objective will be printed out after each iteration |
The method is based on the gradient descent of Ma et al (2020), with initialization of the universal singular value thresholding as discussed there. The parameter identifiability constraint is the same as in the paper.
a list of
Z |
latent positions |
alpha |
individual parameter alpha as in the paper |
Phat |
esitmated probability matrix |
obj |
the objective of the gradient method |
Tianxi Li and Can M. Le
Maintainer: Tianxi Li [email protected]
Z. Ma, Z. Ma, and H. Yuan. Universal latent space model fitting for large networks with edge covariates. Journal of Machine Learning Research, 21(4):1-67, 2020.
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A fit <- LSM.PGD(A,2,niter=50)
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A fit <- LSM.PGD(A,2,niter=50)
selecting block models by NCV of Chen and Lei (2016)
NCV.select(A, max.K, cv = 3)
NCV.select(A, max.K, cv = 3)
A |
adjacency matrix |
max.K |
largest number of communities to check |
cv |
fold of cross-validation |
Spectral clustering is used for fitting the block models
a list of
dev |
the binomial deviance loss under SBM for each K |
l2 |
the L_2 loss under SBM for each K |
dc.dev |
the binomial deviance loss under DCSBM for each K |
dc.l2 |
the L_2 loss under DCSBM for each K |
dev.model |
the selected model by deviance loss |
l2.model |
the selected model by L_2 loss |
sbm.l2.mat , sbm.dev.mat , ....
|
the corresponding matrices of loss for each fold (row) and each K value (column) |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
Chen, K. & Lei, J. Network cross-validation for determining the number of communities in network data Journal of the American Statistical Association, Taylor & Francis, 2018, 113, 241-251
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ncv <- NCV.select(A,6,3) ncv$l2.model ncv$dev.model which.min(ncv$dev) which.min(ncv$l2) which.min(ncv$dc.dev) which.min(ncv$dc.l2)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ncv <- NCV.select(A,6,3) ncv$l2.model ncv$dev.model which.min(ncv$dev) which.min(ncv$l2) which.min(ncv$dc.dev) which.min(ncv$dc.l2)
estimates network connection probability by network mixing of Li and Le (2021).
network.mixing(A,index=NULL,rho = 0.1,max.K=15,dcsbm=TRUE, usvt=TRUE,ns=FALSE, lsm=FALSE,lsm.k=4,trace=FALSE)
network.mixing(A,index=NULL,rho = 0.1,max.K=15,dcsbm=TRUE, usvt=TRUE,ns=FALSE, lsm=FALSE,lsm.k=4,trace=FALSE)
A |
adjacency matrix |
index |
a pre-specified hold-out set. If NULL, the set will be randomly generated according to rho. |
rho |
hold-out proportion as validation entries. Only effective when index is NULL. |
max.K |
the maximum number of blocks used for the block model approximation (see details). |
dcsbm |
whether to include the DCSBM as components, up to max.K. By default, the method will include it. |
usvt |
whether to include the USVT as a component. By default, the method will include it. |
ns |
whether to include the neighborhood smoothing as a component. |
lsm |
whether to include the gradient estimator of the latent space model as a component. |
lsm.k |
the dimension of the latent space. Only effective if lsm is TRUE. |
trace |
whether to print the model fitting progress. |
The basic version of the mixing estimator will include SBM and DCSBM estimates with the number of blocks from 1 to max.K. Users could also specify whether to include additional USVT, neighborhood smoothing and latent space model estimators. If NNL (non-negative linear), exponential, or ECV is used, the mixing is usually robust for a reasonable range of max.K and whether to include the other models. The linear mixing, however, is vulnerable for a large number of base estimates. The NNL is our recommended method. USVT is also recommended. the neighborhood smoothing and latent space model are slower, so are not suitable for large networks. Details can be found in Li and Le (2021).
a list of
linear.Phat |
estimated probability matrix by linear mixing |
linear.weight |
the weights of the indivdiual models in linear mixing |
nnl.Phat |
estimated probability matrix by NNL mixing |
nnl.weight |
the weights of the indivdiual models in NNL mixing |
exp.Phat |
estimated probability matrix by exponential mixing |
exp.weight |
the weights of the indivdiual models in exponential mixing |
ecv.Phat |
estimated probability matrix by ECV mixing (only one nonzero) |
ecv.weight |
the weights of the indivdiual models in ECV mixing (only one nonzero) |
model.names |
the names of all individual models, in the same order as the weights |
Tianxi Li and Can M. Le
Maintainer: Tianxi Li <[email protected]>
T. Li and C. M. Le, Network Estimation by Mixing: Adaptivity and More. arXiv preprint arXiv:2106.02803, 2021.
dt <- RDPG.Gen(n=500,K=5,directed=TRUE) A <- dt$A fit <- network.mixing(A) fit$model.names fit$nnl.weight
dt <- RDPG.Gen(n=500,K=5,directed=TRUE) A <- dt$A fit <- network.mixing(A) fit$model.names fit$nnl.weight
estimates network connection probability by network mixing of Li and Le (2021) with B-fold averaging.
network.mixing.Bfold(A,B=10,rho = 0.1,max.K=15,dcsbm=TRUE,usvt=TRUE,ns=FALSE, lsm=FALSE,lsm.k=4)
network.mixing.Bfold(A,B=10,rho = 0.1,max.K=15,dcsbm=TRUE,usvt=TRUE,ns=FALSE, lsm=FALSE,lsm.k=4)
A |
adjacency matrix |
B |
number of random replications to average over |
rho |
hold-out proportion as validation entries. Only effective when index is NULL. |
max.K |
the maximum number of blocks used for the block model approximation (see details). |
dcsbm |
whether to include the DCSBM as components, up to max.K. By default, the method will include it. |
usvt |
whether to include the USVT as a component. By default, the method will include it. |
ns |
whether to include the neighborhood smoothing as a component. |
lsm |
whether to include the gradient estimator of the latent space model as a component. |
lsm.k |
the dimension of the latent space. Only effective if lsm is TRUE. |
This is essentially the same procedure as the network.mixing, but repeat it B times and return the average. Use with cautious. Though it can make the estimate more stable, the procedure would increase the computational cost by a factor of B. Based on our limited experience, this is usually not a great trade-off as the improvement might be marginal.
a list of
linear.Phat |
estimated probability matrix by linear mixing |
nnl.Phat |
estimated probability matrix by NNL mixing |
exp.Phat |
estimated probability matrix by exponential mixing |
ecv.Phat |
estimated probability matrix by ECV mixing (only one nonzero) |
model.names |
the names of all individual models, in the same order as the weights |
Tianxi Li and Can M. Le
Maintainer: Tianxi Li <[email protected]>
T. Li and C. M. Le, Network Estimation by Mixing: Adaptivity and More. arXiv preprint arXiv:2106.02803, 2021.
dt <- RDPG.Gen(n=200,K=3,directed=TRUE) A <- dt$A fit <- network.mixing.Bfold(A,B=2)
dt <- RDPG.Gen(n=200,K=3,directed=TRUE) A <- dt$A fit <- network.mixing.Bfold(A,B=2)
calculates normalized mutual information, a metric that is commonly used to compare clustering results
NMI(g1, g2)
NMI(g1, g2)
g1 |
a vector of cluster labels |
g2 |
a vector of cluster labels (same length as g1) |
NMI value
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) NMI(ssc$cluster,dt$g)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) NMI(ssc$cluster,dt$g)
estimates NSBM parameters given community labels
NSBM.estimate(A,K,g,reg.bound=-Inf)
NSBM.estimate(A,K,g,reg.bound=-Inf)
A |
adjacency matrix of a directed where Aij = 1 iff i -> j |
K |
number of communities |
g |
a vector of community labels |
reg.bound |
the regularity lower bound of lambda value. By default, -Inf. That means, no constraints. When the network is sparse, using certain constaints may improve stability. |
The method of moments is used for estimating the edge nomination SBM, so the strategy can be used for both unweighted and weighted networks. The details can be found in Li et. al. (2020).
a list of
B |
estimated block connection probability matrix |
lambda |
estimated lambda values for nomination intensity |
theta |
estimated theta values for nomination preference |
P.tilde |
estimated composiste probability matrix after nomination |
g |
community labels |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
T. Li, E. Levina, and J. Zhu. Community models for networks observed through edge nominations. arXiv preprint arXiv:2008.03652 (2020).
dt <- NSBM.Gen(n=200,K=2,beta=0.2,avg.d=10) A <- dt$A sc <- RightSC(A,K=3) est <- NSBM.estimate(A,K=3,g=sc$cluster)
dt <- NSBM.Gen(n=200,K=2,beta=0.2,avg.d=10) A <- dt$A sc <- RightSC(A,K=3) est <- NSBM.estimate(A,K=3,g=sc$cluster)
Generates networks from nomination stochastic block model for community structure in edge nomination procedures, proposed in Li et. al. (2020)
NSBM.Gen( n, K, avg.d,beta,theta.low=0.1, theta.p=0.2,lambda.scale=0.2,lambda.exp=FALSE)
NSBM.Gen( n, K, avg.d,beta,theta.low=0.1, theta.p=0.2,lambda.scale=0.2,lambda.exp=FALSE)
n |
size of network |
K |
number of communities |
avg.d |
expected average degree of the resuling network (after edge nomination) |
beta |
the out-in ratio of the original SBM |
theta.low |
the lower value of theta's. The theta's are generated as two-point mass at theta.low and 1. |
theta.p |
proportion of lower value of theta's |
lambda.scale |
standard deviation of the lambda (before the exponential, see lambda.exp) |
lambda.exp |
If TRUE, lambda is generated as exponential of uniformation random randomes. Otherwise, they are normally distributed. |
A list of
A |
the generated network adjacency matrix |
g |
community membership |
P |
probability matrix of the orignal SBM network |
P.tilde |
probability matrix of the observed network after nomination |
B |
B parameter |
lambda |
lambda parameter |
theta |
theta parameter |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
T. Li, E. Levina, and J. Zhu. Community models for networks observed through edge nominations. arXiv preprint arXiv:2008.03652 (2020).
dt <- NSBM.Gen(n=200,K=2,beta=0.2,avg.d=10)
dt <- NSBM.Gen(n=200,K=2,beta=0.2,avg.d=10)
estimates probabilty matrix by neighborhood smoothing of Zhang et. al. (2017)
nSmooth(A, h = NULL)
nSmooth(A, h = NULL)
A |
adjacency matrix |
h |
quantile value used for smoothing. Recommended to be in the scale of sqrt(log(n)/n) where n is the size of the network. The default value is sqrt(log(n)/n) from the paper. |
The method assumes a graphon model where the underlying graphon function is piecewise Lipchitz. However, it may be slow for moderately large networks, though it is one of the fastest methods for graphon models.
the probability matrix
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li <[email protected]>
Zhang, Y.; Levina, E. & Zhu, J. Estimating network edge probabilities by neighbourhood smoothing Biometrika, Oxford University Press, 2017, 104, 771-783
N <- 100 U = matrix(1:N,nrow=1) / (N+1) V = matrix(1:N,nrow=1) / (N+1) W = (t(U))^2 W = W/3*cos(1/(W + 1e-7)) + 0.15 upper.index <- which(upper.tri(W)) A <- matrix(0,N,N) rand.ind <- runif(length(upper.index)) edge.index <- upper.index[rand.ind < W[upper.index]] A[edge.index] <- 1 A <- A + t(A) diag(A) <- 0 What <- nSmooth(A)
N <- 100 U = matrix(1:N,nrow=1) / (N+1) V = matrix(1:N,nrow=1) / (N+1) W = (t(U))^2 W = W/3*cos(1/(W + 1e-7)) + 0.15 upper.index <- which(upper.tri(W)) A <- matrix(0,N,N) rand.ind <- runif(length(upper.index)) edge.index <- upper.index[rand.ind < W[upper.index]] A[edge.index] <- 1 A <- A + t(A) diag(A) <- 0 What <- nSmooth(A)
generates random networks from random dot product graph model
RDPG.Gen(n, K, directed = TRUE, avg.d = NULL)
RDPG.Gen(n, K, directed = TRUE, avg.d = NULL)
n |
size of the network |
K |
dimension of latent space |
directed |
whether the network is directed or not |
avg.d |
average node degree of the network (in expectation) |
The network is generated according to special formulation mentioned in ECV paper.
a list of
A |
the adjacency matrix |
P |
the probability matrix |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
S. J. Young and E. R. Scheinerman. Random dot product graph models for social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 138-149. Springer, 2007. T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2), pp.257-276, 2020.
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A
community detection by regularized spectral clustering
reg.SP(A, K, tau = 1, lap = FALSE,nstart=30,iter.max=100)
reg.SP(A, K, tau = 1, lap = FALSE,nstart=30,iter.max=100)
A |
adjacency matrix |
K |
number of communities |
tau |
reguarlization parameter. Default value is one. Typically set between 0 and 1. If tau=0, no regularization is applied. |
lap |
indicator. If TRUE, the Laplacian matrix for clustering. If FALSE, the adjacency matrix will be used. |
nstart |
number of random initializations for K-means |
iter.max |
maximum number of iterations for K-means |
The regularlization is done by adding a small constant to each element of the adjacency matrix. It is shown by such perturbation helps concentration in sparse networks. It is shown to give consistent clustering under SBM.
a list of
cluster |
cluster labels |
loss |
the loss of Kmeans algorithm |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li <[email protected]>
K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, pages 1878-1915, 2011.
A. A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097-2122, 2013.
J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215-237, 2014.
C. M. Le, E. Levina, and R. Vershynin. Concentration and regularization of random graphs. Random Structures & Algorithms, 2017.
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- reg.SP(A,K=3,lap=TRUE) NMI(sc$cluster,dt$g)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- reg.SP(A,K=3,lap=TRUE) NMI(sc$cluster,dt$g)
community detection by regularized spherical spectral clustering
reg.SSP(A, K, tau = 1, lap = FALSE,nstart=30,iter.max=100)
reg.SSP(A, K, tau = 1, lap = FALSE,nstart=30,iter.max=100)
A |
adjacency matrix |
K |
number of communities |
tau |
reguarlization parameter. Default value is one. Typically set between 0 and 1. If tau=0, no regularization is applied. |
lap |
indicator. If TRUE, the Laplacian matrix for clustering. If FALSE, the adjacency matrix will be used. |
nstart |
number of random initializations for K-means |
iter.max |
maximum number of iterations for K-means |
The regularlization is done by adding a small constant to each element of the adjacency matrix. It is shown by such perturbation helps concentration in sparse networks. The difference from spectral clustering (reg.SP) comes from its extra step to normalize the rows of spectral vectors. It is proved that it gives consistent clustering under DCSBM.
a list of
cluster |
cluster labels |
loss |
the loss of Kmeans algorithm |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li <[email protected]>
T. Qin and K. Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pages 3120-3128, 2013.
J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215-237, 2014.
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) NMI(ssc$cluster,dt$g)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0.9,simple=FALSE,power=TRUE) A <- dt$A ssc <- reg.SSP(A,K=3,lap=TRUE) NMI(ssc$cluster,dt$g)
community detection by regularized spectral clustering on right singular vectors
RightSC(A, K, normal = FALSE)
RightSC(A, K, normal = FALSE)
A |
adjacency matrix of a directed adjacecy matrix |
K |
number of communities |
normal |
indicator. If TRUE, normalization of singular vector rows will be applied, similar to the spectral spherical clustering. |
This is essentially the spectral clustering applied on right singular vectors. It can be used to handle directed networks where Aij = 1 if and only if i -> j, and the edges tend to have a missing issue specifically depending on the sender i. More details can be found in Li et. al. (2020).
a list of
cluster |
cluster labels |
loss |
the loss of Kmeans algorithm |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li <[email protected]>
T. Li, E. Levina, and J. Zhu. Community models for networks observed through edge nominations. arXiv preprint arXiv:2008.03652 (2020).
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- RightSC(A,K=2)
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- RightSC(A,K=2)
estimates SBM parameters given community labels
SBM.estimate(A, g)
SBM.estimate(A, g)
A |
adjacency matrix |
g |
a vector of community labels |
maximum likelhood is used
a list of
B |
estimated block connection probability matrix |
Phat |
estimated probability matrix |
g |
community labels |
Tianxi Li, Elizaveta Levina, Ji Zhu
Maintainer: Tianxi Li [email protected]
B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- reg.SP(A,K=3,lap=TRUE) sbm <- SBM.estimate(A,sc$cluster) sbm$B
dt <- BlockModel.Gen(30,300,K=3,beta=0.2,rho=0) A <- dt$A sc <- reg.SP(A,K=3,lap=TRUE) sbm <- SBM.estimate(A,sc$cluster) sbm$B
oracle smooth graphon estimation according to given latent positions, based on smooth estimation.
smooth.oracle(Us,A)
smooth.oracle(Us,A)
Us |
a vector whose elements are the latent positions of the network nodes under the graphon model. The dimension of the vector equals the number of nodes in the network. |
A |
adjacency matrix. It does not have to be unweighted. |
Note that the latenet positions are unknown in practice, so this estimation is an oracle estimation mainly for evaluation purpose. However, if the latenet positions can be approximated estimated, this function can also be used for estimating the edge probability matrix. This procedure is the M-step of the algorithm used in Sischka & Kauermann (2022). Our implementation is modified from the contribution of an anonymous reviewer, leveraging the tools of the sparseFLMM package.
The estimated probability matrix.
Tianxi Li, Elizaveta Levina, Ji Zhu, Can M. Le
Maintainer: Tianxi Li [email protected]
Sischka, B. and Kauermann, G., 2022. EM-based smooth graphon estimation using MCMC and spline-based approaches. Social Networks, 68, pp.279-295.
set.seed(100) dt <- BlockModel.Gen(10,50,K=2,beta=0.2) ## oracle order oracle.index <- sort(dt$g,index.return=TRUE)$ix A <- dt$A[oracle.index,oracle.index] oracle.est <- smooth.oracle(seq(0,1,length.out=50),A)
set.seed(100) dt <- BlockModel.Gen(10,50,K=2,beta=0.2) ## oracle order oracle.index <- sort(dt$g,index.return=TRUE)$ix A <- dt$A[oracle.index,oracle.index] oracle.est <- smooth.oracle(seq(0,1,length.out=50),A)
estimates the network probability matrix by the universal singular value thresholding of Chatterjee (2015), with the improvement mentioned in Zhang et. al. (2017).
USVT(A)
USVT(A)
A |
adjacency matrix |
Instead of using the original threshold in Chatterjee (2015), the estimate is generated by taking the n^(1/3) leading spectral components. The method was mentioned in Zhang et. al. (2017) and suggested by an anonymous reviewer.
The estimated probability matrix.
Tianxi Li and Can M. Le
Maintainer: Tianxi Li [email protected]
S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177-214, 2015. Y. Zhang, E. Levina, and J. Zhu. Estimating network edge probabilities by neighbourhood smoothing. Biometrika, 104(4):771-783, 2017.
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A fit <- USVT(A)
dt <- RDPG.Gen(n=600,K=2,directed=TRUE) A <- dt$A fit <- USVT(A)