Some Measurements on Direct Connect File Lists
(Published on 2014-01-09)
I've been working on Direct Connect related projects for a while now. This includes maintaining ncdc and Globster, and doing a bit of research into improving the downloading performance and scalability (to be published at some later date). Whether I'm writing code or trying to setup experiments for research, there's one thing that helps a lot in making decisions. Measurements from an actual network.
Because useful measurements are often missing, I decided to do some myself. There's a lot to measure in an actual P2P network, but I restricted myself to information that can be gathered quite easily from file lists.
Obtaining the Data
Different hubs will likely have totally different patterns in terms of what is being shared. In order to keep this experiment simple, I limited myself to a single hub. And in order to get as much data as possible, I chose the hub that is commonly known as "Ozerki", famous for being one of the larger hubs in existence.
My approach to getting as many file lists as possible from this hub was perhaps a bit too simple. I simply modified ncdc to have an "add the file list from all users to the download queue" key, and to save all downloaded lists to a directory instead of opening them.
I started this downloading process on a Monday around noon when there were a little over 11k users online. I hit my hacked download-all-filelists-key two more times later that day in order to get the file lists from those users who joined the hub at a later time. I let this downloading process running until the evening.
One thing I learned from this experience was that the downloading algorithm in ncdc (1.18.1) does not scale particularly well. Every 60 seconds, it would try to open a connection with all users listed in the download queue. You can imagine that trying to connect to 11k users simultaneously put a significantly heavier load on the hub than would have been necessary. Not good. Not something a well-behaving netizen would do. Surprisingly enough, the hub didn't seem to mind too much and handled the load fine. This might have been because Mondays are typically not the most busy days in P2P land. Weekends tend to be busier.
Despite that scalability issue, I successfully managed to download the file lists of almost everyone who remained online for long enough to finally get their list downloaded. In total I managed to download 14143 file lists (that's one list too many for
10000*sqrt(2), I should have stopped the process a bit earlier). The total bzip2-compressed size of these lists is 6.5 GiB.
For obvious reasons, I won't be sharing my modifications to ncdc. I already tarnished the reputation of ncdc enough in that single day. If you wish to repeat this experiment, please do so with a scalable downloading implementation. :-)
Obtaining the Stats
And then comes the challenge of aggregating statistics on 6.5 GiB of compressed XML files. This didn't really sound like much of a challenge. After all, all one needs to do is decompress the file lists, do some XML parsing and update some values. Most of the CPU time in this process would likely be spent on bzip2 decompression, so I figured I'd just pipe the output of bzcat(1) to a Perl script and be done with it.
To get the statistics on the sizes and the distribution of unique files, a data structure containing information on all unique files in the lists was necessary. Perl being the perfect language for data manipulation, I made use of its great support for hash tables to store this information. It turned out, rather unsurprisingly, that Perl isn't all that conservative with respect to memory usage. Neither my 4GB or RAM nor the extra 4GB of swap turned out to be enough to run the script to completion. I tried rewriting the script to use a disk-based data structure, but that slowed things down to a crawl. Some other solution was needed.
When faced with such a problem, some people will try to optimize the algorithm, others will throw extra hardware at it, and I did what I do best: Optimize away the constants. That is, I rewrote the data analysis program in C. Using the excellent khash hash table library to keep track of the file information and the equally awesome yxml library (a little bit of self-promotion doesn't hurt, right?) to do the XML parsing, I was able to do all the necessary processing in 30 minutes using at most 3.6GB of RAM.
Long story short, here's my analysis program: dcfilestats.c.
A Look at the Stats
Some lists didn't decompress/parse correctly, so the actual number of file lists used in these stats is 14137. The total compressed size of these lists is 6,945,269,469 bytes (6.5 GiB), and uncompressed 25,533,519,352 bytes (24 GiB). In total these lists mentioned 197,413,253 files. After taking duplicate listings in account, there's still 84,131,932 unique files.
And now for some graphs...
Size of the File Lists
Behold, the compressed and uncompressed size of the downloaded file lists:
Nothing too surprising here, I guess. 100 KiB seems to be a common size for a compressed file lists, but lists of 1 MiB aren't too weird, either. The largest file list in this set is 34.8 MiB compressed and 120 MiB uncompressed. The uncompressed size of a list tends to be (*gasp*) a bit larger, but we can't easily infer the compression ratio from this graph. Hence, another graph:
Most file lists compress to about 24% - 35% of their original size. This seems to be consistent with similar measurements done in 2010.
The raw data for these graphs is found in dclistsize, which lists the compressed and uncompressed size, respectively, for each file list. The gnuplot script for the first graph is dclistsize.plot and dclistcomp.plot for the second.
Number of Files Per List
So how many files are people sharing? Let's find out.
As expected, this graph looks very similar to the one about the size of the file list. The size of a list tends to be linear in the number of items it holds, after all.
And how large are the files being shared? Well,
This graph is fun, and rather hard to explain without knowing what kind of files we're dealing with. I'm not going to do any further analysis on what kind of files these file sizes represent exactly, but I am going to make some guesses. The files below 1 MiB could be anything, text files, images, subtitles, source code, etc. And considering that the hub in question doesn't put a whole lot of effort in weeding out spammers and bots, it's likely that some malicious users will be sharing small variations of the same virus within the 100 KiB range. The peak of files between 7 and 10 MiB would likely be audio files. The number of files larger than, say, 20 MiB drop significantly, but there are still a few million files in the 20 MiB to 1 GiB range.
I cut off the graph after 10 GiB, but there's apparently someone who claims to share a file between 1 and 2 TiB (don't know the exact size due to the binning). Since I can't imagine why someone would share a file that large, I expect it to be a fake file list entry. Note that there could be more fakes in my data set. I can't tell which files are fake and which are genuine from the information in the file lists, but I don't expect the number of fake files to be very significant.
The "raw" data for this graph is found in dcfilesize. Because I wasn't interested in dealing with a text file of 84 million lines, the data is already binned. The first column is the bin number and the second column the number of unique files in that bin. The file sizes that each bin represents are between
2^(bin+10), with the exception of bin 0, which starts at a file size of 0. The source of the gnuplot script is dcfilesize.plot.
Distribution of Files
Another interesting thing to measure is how often files are shared. That is, how many users have the same file?
Many files are only available from a single user. That's not really a good sign when you wish to download such a file, but luckily there are also tons of files that are available from multiple users. What is interesting in this graph isn't that it follows the power law, but it's wondering what those outliers could possibly be. There's a collection of 269 files that has been shared among 831 users, and there appears to be a similar group of around 510-515 files that is shared among 20 or so users. I've honestly no idea what those collections could be. Well, yes, I could probably figure that out from the file lists, but my analysis program doesn't tell me which files it's talking about and I'm too lazy to fix that.
The graph has been clipped to 600, but there's another interesting outlier. A single file that has been shared by 5668 users. I'm going to guess that this is the empty file. There are so many ways to get an empty file somewhere in your filesystem, after all.
So, erm, what conclusions can we draw from this? That stats are fun, I guess. If anyone (including me) is going to repeat this experiment on a fresh data set, make sure to use a more scalable downloading process than I did. My approach shouldn't be repeated if we wish to keep the Direct Connect network alive.
Furthermore, keep in mind that this is just a snapshot of a single day on a single hub. The graphs may look very different when the file lists are harvested at some other time. And it's also quite likely that different hubs will have very different share profiles. It could be interesting to try and graph everything, but I don't have that kind of free time.