An efficient exact graph sampling algorithm
About the code
The code is an implementation of the algorithm described in [1]. It provides an efficient way to perform sampling of the realizations of any given degree sequence. Previously existing graph sampling methods were either link-swap based (Markov-Chain Monte Carlo algorithms) or stub-matching based (the Configuration Model). Both types are ill-controlled, with typically unknown mixing times for link-swap methods and uncontrolled rejections for the Configuration Model. Conversely, this is an efficient, polynomial time algorithm that generates statistically independent graph samples with a given, arbitrary, degree sequence. Unlike other algorithms, this method always produces a sample, without backtracking or rejections.
To download the code, click here.
If you use this code for your research, I kindly ask you to cite Ref. 1 in your publications.
For questions, requests or notifications, please write to C dot I dot del-Genio at warwick dot ac dot uk.
Usage
The code consists of the files GSampler.c, GSampler.h and GSamp.h. To use it, just include GSamp.h in your code, and compile GSampler.c with your other source files, remembering to use the option -lm as it needs to link to the math library.
The code defines two new data structures, called sequence and graph. The members of the structure sequence, used internally by the sampling algorithm, are
- int *degree
- int *label
- int *forbidden
The members of the structure graph are
- int **list
- double weight
Before starting sampling a degree sequence, the sampler needs to be initialized by invoking the function gsaminit. The prototype of the function is
void gsaminit(const int *seq, const int n)
where seq is a pointer to the degree sequence, and n is the number of nodes in the sequence.
To create and store a sample realization of a given degree sequence, the user should invoke the function gsam. The prototype of the function is
graph gsam(double (*rng)(void), const int stsm)
The function returns a realization of the proper degree sequence for which it has been initialized, using the user-specified random number generator rng. The array of integers seq[] must be a proper degree sequence, that is, a non-increasing graphical sequence of integers. The random number generator must be a function taking no input parameters and returning a double precision floating point number between 0 and 1. The generator must be already seeded. This leaves the user the choice of the generator to use. The variable stsm is a flag governing the way target nodes are chosen for connection: if set to 0, the nodes are chosen randomly amongst those allowed; if set to anything but 0, the nodes are chosen with a probability proportional to their residual degree. Given a sequence and a choice of random number generator, the user creates a sample by declaring a variable G of type graph, and then assigning it the return value of gsam:
G = gsam(rng);
After the assignment, G.list is a densely allocated matroid containing the adjacency list, and G.weight is the logarithm of the weight associated with that particular sample. New samples of the sequence can be obtained invoking gsam again.
Please note that the adjacency list of a previous sample is destroyed with further calls to the sampler, even if the sample is assigned to a different variable. Thus, for instance, the lines
G = gsam(rng);
H = gsam(rng);
will result in the same adjacency list stored in G.list and in H.list.
After finishing sampling a given sequence, the memory used should be cleaned by invoking gsamclean(). Afterwards, the sampler is ready to be initialized again with another degree sequence.
A minimal proof of concept program is included, in the file poc.c. The code can be compiled on a standard GNU/Linux distribution with the command
gcc -std=c99 -lm -o poc GSampler.c poc.c
The program invokes the graph sampling algorithm to produce 10 realizations of the sequence {2, 2, 2, 1, 1}, using a simple random number generator. After the generation of each sample, the adjacency list and the logarithm of the weight are displayed on screen. Please note that (pseudo) random number generation for scientific or cryptographic applications is a complex subject, and the actual generator to use in publication-level sampling should be an established, tested, one. In the proof of concept, a simple one is used just for sake of simplicity. It should probably not be used otherwise, and definitely not be used for cryptographic applications, as there exist more appropriate and far better generators.
Suggestions
One of the return values provided by the code is the logarithm of the weight associated with each sample, to be used in an expression for the weighted mean of some observable. To avoid dealing with numbers of substantially different order of magnitude, a useful trick is to employ a formula for the logarithm of the sum. This way, one can find directly log(a+b) knowing log(a) and log(b). To see how this works, call x=log(a), y=log(b), and result=log(a+b). Then the following chain of identities holds:
log(a+b) |
= log(a*(1+b/a)) |
= log(a) + log(1+b/a) | |
= x + log(1+b/a) | |
= x + log(1+exp(y)/exp(x)) | |
= x + log(1+exp(y-x)) |
Now notice that, if y>x, then y-x>0, and therefore the exponential in the expression above can grow without control. However, if y≤x, then the argument of that same exponential is negative or 0. Then, in this case, that exponential will be a real number between 0 and 1. If it is so, then the second term in the sum is log(1+ε), with ε between 0 and 1. But this is a very easily computed quantity, as it can be comfortably and precisely expanded in series, so much that the C programming language even has a function for it (log1p). Then, knowing x and y, all one needs to do is to make sure that y≤x. Since the sum is a symmetric operation, all this can be easily written in C as
result = fmax(x,y) + log1p(exp(-fabs(x-y)));
fmax returns whichever is greater between x and y, fabs returns the absolute value of the difference between x and y, and the minus sign before it makes sure that the exponential is negative. Since the exponential is quite small, the whole formula is particularly stable.
Aside from this, there are still some caveats. The first is that, in the weighted mean formula for a series of observable measurements Q_{i} with weights w_{i}, it's probably better not to compute the sum of w_{i}Q_{i} and the sum of w_{i} independently and then subtract the logarithms. Instead, one can use a stable algorithm to directly compute the ensemble average of Q on the fly. A particularly well-suited algorithm is West's algorithm [2], which is very straightforward. An easy explanation of the algorithm can be found here, under the section "Weighted incremental algorithm". As a good side-effect, the algorithm will also provide the uncertainty associated with the ensemble average of Q. Notice that, as it's discussed in West's original paper, this algorithm should be used only when one cannot save all the data and analyse them later, in which case the best choice would be a two-pass algorithm.
A second point of caution is that when computing mean and standard deviation, one often ends up not just summing, but also subtracting. In fact, subtractions are carried out about 50% of the times. The above formula for the logarithmic sum can be adapted for subtraction too, becoming
log(a-b) = x + log(1-exp(y-x)),
or, in C,
result = x + log1p(-exp(y-x));
The formula is always valid in the general case of a>b, but it's not as stable as that of the sum. The reason is that log(1+ε) changes relatively slowly for ε>0, but it changes quite quickly for ε<0. However, this is not too big a concern in the case of a mean and standard deviation calculation. In fact, West's algorithm converges relatively quickly to the correct value. This means that the amplitude of the oscillations around the actual ensemble average will quickly decrease. Thus, potentially the only problematic situations can happen for the first few terms in the calculation of the mean (or better, for half of them), but typically this is not a problem.
Finally, the last thing to be aware of is that of course the logarithmic formulae above will work only if one is dealing with positive numbers. However, some observables could very well be negative. For instance, one might be measuring one of the assortativity coefficients of a graph. In this case, the range of possible values for the observable would be from -1 to 1. Anyway, problems such as this are easily solved if one knows the theoretical range of the measurements. Then, one can artificially sum a certain same number to all the measurements, and average over these "shifted" results. In the assortativity example, one could sum 2 to all the measurements, thus making sure that the averages would be over the range 1 to 3, and then subtract 2 from the final result. This would guarantee that one never tries to take the logarithm of a negative number, and, importantly, that one can always say, a priori, which is the greater of the two numbers involved at every step.
References
[1] Del Genio, Kim, Toroczkai and Bassler, PLoSOne 5 (4), e10012
[2] West, Commun. ACM 22, 532-535
Release information
Current version
4.13: bug fix: sampling weights for node sampling are now correct (stub sampling was not affected).
Old versions
4.12: introduced option to choose between stub or node selection; reverted rescaling introduced in version 4.9; fixed function prototyping.
4.11: bug fix: when using the sampler iteratively on different sequences the amount of memory allocated could be more than actually needed or used.
4.10: fixed compliance with C99 standard.
4.9: the code now rescales the distribution of weights to keep the numbers small.
4.8: the code now accepts sequences ending with a series of nodes with 0 degree. These nodes will simply be ignored, as they will not form any edge in the final sample. Also, the program now chooses nodes out of the allowed set with probability proportional to their residual degrees, rather than uniformly.
License and Copyright
This program is Copyright © 2009-2015 Charo I. Del Genio
Licensed under the Academic Free License version 3.0
1) Grant of Copyright License. Licensor grants You a worldwide, royalty-free, non-exclusive, sublicensable license, for the duration of the copyright, to do the following:
a) to reproduce the Original Work in copies, either alone or as part of a collective work;
b) to translate, adapt, alter, transform, modify, or arrange the Original Work, thereby creating derivative works ("Derivative Works") based upon the Original Work;
c) to distribute or communicate copies of the Original Work and Derivative Works to the public, under any license of your choice that does not contradict the terms and conditions, including Licensor's reserved rights and remedies, in this Academic Free License;
d) to perform the Original Work publicly; and
e) to display the Original Work publicly.
2) Grant of Patent License. Licensor grants You a worldwide, royalty-free, non-exclusive, sublicensable license, under patent claims owned or controlled by the Licensor that are embodied in the Original Work as furnished by the Licensor, for the duration of the patents, to make, use, sell, offer for sale, have made, and import the Original Work and Derivative Works.
3) Grant of Source Code License. The term "Source Code" means the preferred form of the Original Work for making modifications to it and all available documentation describing how to modify the Original Work. Licensor agrees to provide a machine-readable copy of the Source Code of the Original Work along with each copy of the Original Work that Licensor distributes. Licensor reserves the right to satisfy this obligation by placing a machine-readable copy of the Source Code in an information repository reasonably calculated to permit inexpensive and convenient access by You for as long as Licensor continues to distribute the Original Work.
4) Exclusions From License Grant. Neither the names of Licensor, nor the names of any contributors to the Original Work, nor any of their trademarks or service marks, may be used to endorse or promote products derived from this Original Work without express prior permission of the Licensor. Except as expressly stated herein, nothing in this License grants any license to Licensor's trademarks, copyrights, patents, trade secrets or any other intellectual property. No patent license is granted to make, use, sell, offer for sale, have made, or import embodiments of any patent claims other than the licensed claims defined in Section 2. No license is granted to the trademarks of Licensor even if such marks are included in the Original Work. Nothing in this License shall be interpreted to prohibit Licensor from licensing under terms different from this License any Original Work that Licensor otherwise would have a right to license.
5) External Deployment. The term "External Deployment" means the use, distribution, or communication of the Original Work or Derivative Works in any way such that the Original Work or Derivative Works may be used by anyone other than You, whether those works are distributed or communicated to those persons or made available as an application intended for use over a network. As an express condition for the grants of license hereunder, You must treat any External Deployment by You of the Original Work or a Derivative Work as a distribution under section 1(c).
6) Attribution Rights. You must retain, in the Source Code of any Derivative Works that You create, all copyright, patent, or trademark notices from the Source Code of the Original Work, as well as any notices of licensing and any descriptive text identified therein as an "Attribution Notice." You must cause the Source Code for any Derivative Works that You create to carry a prominent Attribution Notice reasonably calculated to inform recipients that You have modified the Original Work.
7) Warranty of Provenance and Disclaimer of Warranty. Licensor warrants that the copyright in and to the Original Work and the patent rights granted herein by Licensor are owned by the Licensor or are sublicensed to You under the terms of this License with the permission of the contributor(s) of those copyrights and patent rights. Except as expressly stated in the immediately preceding sentence, the Original Work is provided under this License on an "AS IS" BASIS and WITHOUT WARRANTY, either express or implied, including, without limitation, the warranties of non-infringement, merchantability or fitness for a particular purpose. THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL WORK IS WITH YOU. This DISCLAIMER OF WARRANTY constitutes an essential part of this License. No license to the Original Work is granted by this License except under this disclaimer.
8) Limitation of Liability. Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall the Licensor be liable to anyone for any indirect, special, incidental, or consequential damages of any character arising as a result of this License or the use of the Original Work including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.
9) Acceptance and Termination. If, at any time, You expressly assented to this License, that assent indicates your clear and irrevocable acceptance of this License and all of its terms and conditions. If You distribute or communicate copies of the Original Work or a Derivative Work, You must make a reasonable effort under the circumstances to obtain the express assent of recipients to the terms of this License. This License conditions your rights to undertake the activities listed in Section 1, including your right to create Derivative Works based upon the Original Work, and doing so without honoring these terms and conditions is prohibited by copyright law and international treaty. Nothing in this License is intended to affect copyright exceptions and limitations (including "fair use" or "fair dealing"). This License shall terminate immediately and You may no longer exercise any of the rights granted to You by this License upon your failure to honor the conditions in Section 1(c).
10) Termination for Patent Action. This License shall terminate automatically and You may no longer exercise any of the rights granted to You by this License as of the date You commence an action, including a cross-claim or counterclaim, against Licensor or any licensee alleging that the Original Work infringes a patent. This termination provision shall not apply for an action alleging patent infringement by combinations of the Original Work with other software or hardware.
11) Jurisdiction, Venue and Governing Law. Any action or suit relating to this License may be brought only in the courts of a jurisdiction wherein the Licensor resides or in which Licensor conducts its primary business, and under the laws of that jurisdiction excluding its conflict-of-law provisions. The application of the United Nations Convention on Contracts for the International Sale of Goods is expressly excluded. Any use of the Original Work outside the scope of this License or after its termination shall be subject to the requirements and penalties of copyright or patent law in the appropriate jurisdiction. This section shall survive the termination of this License.
12) Attorneys' Fees. In any action to enforce the terms of this License or seeking damages relating thereto, the prevailing party shall be entitled to recover its costs and expenses, including, without limitation, reasonable attorneys' fees and costs incurred in connection with such action, including any appeal of such action. This section shall survive the termination of this License.
13) Miscellaneous. If any provision of this License is held to be unenforceable, such provision shall be reformed only to the extent necessary to make it enforceable.
14) Definition of "You" in This License. "You" throughout this License, whether in upper or lower case, means an individual or a legal entity exercising rights under, and complying with all of the terms of, this License. For legal entities, "You" includes any entity that controls, is controlled by, or is under common control with you. For purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
15) Right to Use. You may use the Original Work in all ways not otherwise restricted or conditioned by this License or by law, and Licensor promises not to interfere with or be responsible for such uses by You.
16) Modification of This License. This License is Copyright © 2005 Lawrence Rosen. Permission is granted to copy, distribute, or communicate this License without modification. Nothing in this License permits You to modify this License as applied to the Original Work or to Derivative Works. However, You may modify the text of this License and copy, distribute or communicate your modified version (the "Modified License") and apply it to other original works of authorship subject to the following conditions: (i) You may not indicate in any way that your Modified License is the "Academic Free License" or "AFL" and you may not use those names in the name of your Modified License; (ii) You must replace the notice specified in the first paragraph above with the notice "Licensed under <insert your license name here>" or with a notice of your own that is not confusingly similar to the notice in this License; and (iii) You may not claim that your original works are open source software unless your Modified License has been approved by Open Source Initiative (OSI) and You comply with its license review and certification process.