Co-authorship network with ground truth (for overlapping community detection)

We construct co-authorship networks from DBLP, and Microsoft Academic Graph (MAG). Please see our paper for the citations to these two datasets.

For DBLP, each community is a group of conferences; for MAG, each community is denoted by a ‘‘field of study’’ (FOS) tag. Each author’s ground truth community distribution (\(\mathbf{\theta}\) vector) is constructed by normalizing the number of papers he/she has published in conferences in a subfield (or papers that have the FOS tag). Please read our paper for details.

We also construct bipartite version of the DBLP networks, where each node can either be an author or a paper, and the edges are between authors and papers. Please read our paper for details.

Community Structure for Different Networks

DBLP1 has 6 communities as:

- Machine Learning: NIPS, ICML, AISTATS, UAI
- Theoritical Computer Science: STOC, FOCS, SODA, COLT, ITCS, RANDOM, ICALP, ISAAC
- Data Mining: KDD, ICDM, CIKM, SDM, WSDM, RecSys
- Computer Vision: CVPR, ICCV, ECCV, ICIP
- Artificial Intelligence: AAAI, IJCAI
- Natural Language Processing: ACL, NAACL, EMNLP, CONLL, COLING, EACL, SIGIR

DBLP2 has 3 communities as:

- Networking and Communications: INFOCOM, GLOBECOM, ICC
- Systems: OSDI, SOSP, NSDI, SIGCOMM, MOBICOM, MOBISYS, CONEXT, ATC
- Information Theory: ISIT, ITA, SIGMETRICS, MOBIHOC

DBLP3 has 3 communities as:

- Databases: VLDB, SIGMOD, PODS, CIKM, ICDE
- Data Mining: KDD, ICDM, SDM, SIGIR
- World Web Wide: WWW, WSDM, WINE, ICWSM

DBLP4 has 3 communities as:

- Programming Languages: PLDI, POPL, OOPSLA, ICLP, ESOP, ICFP
- Software Engineering: FSE, ICSE, ASE/KBSE
- Formal Methods: CAV, FM, SAS, FMSD, IFM, ICFEM, FORTE, CADE, TABLEAUX, LPAR

DBLP5 has 4 communities as:

- Computer Architecture: ASPLOS, ISCA, MICRO, HPCA
- Computer Hardware: FPGA, CHES, ICCD, ISLPED, ASAP, ISPD
- Real-time and Embedded Systems: RTSS, RTAS, ECRTS, MODELS, LCTRTS, CASES, EMSOFT, SCOPES
- Computeraided Design: DAC, ICCAD, DATE, ASPDAC

MAG1 has 3 communities as:

- Computational Biology and Bioinformatics
- Organic Chemistry
- Genetics

MAG2 has 3 communities as:

- Machine Learning
- Artificial Intelligence
- Mathematical Optimization

Data Format

For each network, there are two txt files:

Adjacency Matrix \(\mathbf{A}\in\mathbb{R}^{n\times n}\):

  • each row of the txt file is ‘‘Node_ID_1 Node_ID_2’’, representing an edge between Node_ID_1 and Node_ID_2

Community Groud Truth \(\mathbf{\Theta}\in\mathbb{R}^{n\times K}\):

  • Sparse Matrix Format, each row of the txt file is ‘‘Row_\(i\) Column_\(j\) Value_\(\mathbf{\Theta}_{ij}\)’’

Download

The data can be downloaded from here. The bipartite version of DBLP networks can be downloaded from here.

Seperate files:

Code

  • The matlab implementation of GeoNMF in our paper can be downloaded from here.

  • The matlab implementation of SPACL in our paper can be downloaded from here.

Citation

Xueyu Mao, Purnamrita Sarkar, and Deepayan Chakrabart, ‘‘On Mixed Memberships and Symmetric Nonnegative Matrix Factorizations’’, in Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2324-2333, 2017. [BibTeX]

Xueyu Mao, Purnamrita Sarkar, and Deepayan Chakrabart, ‘‘Estimating Mixed Memberships with Sharp Eigenvector Deviations’’, Journal of the American Statistical Association, DOI: 10.1080/01621459.2020.1751645 [BibTeX]