Overview
Here we provide two datasets of weblog entries for download. The datasets contain following metadata of weblog entries, mainly compopsed of those written in Japanese.
- Unique IDs of enties.
- URIs of entries.
- Posted dates of entries.
- Trackbacks between entries.
The datasets had been gathered by crawling from the WWW. Tha crawling was operated from randomly selected entries, then tracing back trackbacks pointing to the entry to the sources of the trackbacks, and so on, until every entries connected by trackbacks had been gathered.
Samples
Each dataset consists of two, tab-separeted text files (*.tsv). One (entry.tsv) contains unique IDs, URIs and posted date of entries in each column. Some sample is presented below.
146390 http://sugamo.jugem.cc/?eid=268 2004-10-14 00:00:00146540 http://numazu.jugem.jp/?eid=333 2004-10-13 00:00:00
146541 http://redkyudan.jugem.jp/?eid=372 2004-10-13 00:00:00
146538 http://brownshoes.jugem.cc/?eid=171 2004-10-13 00:00:00
:
:
Posted dates are obtained by parsing the body text of the entries. Because of this, they might not represent the true timestamps. Entries the posted date of which cannot properly be parsed are not included in the dataset.
The other (trackback.tsv) contains pairs of entry IDs, which represent trackbacks. First column of each row is an entry ID of the source of a trackback, and second column is the destination. Please not that trackbacks are directed. Here is some sample.
146963 115934129257 129266
118694 118664
146906 146910
:
:
The seed URIs of crawling process are listed in another text file (seeds.txt).
Related Papers
- Makoto Uchida and Naoki Shibata. Extracting and Visualization of an Emerging Topic from the Blogspace. In proceedings of the 20th Annual Conference of the Japanese Society for Artifical Inteligence, (2006) PDF (In Japanese).
- Makoto Uchida and Susumu Shirayama. Formation of patterns from complex networks. In proceedings of 12th International Symposium on Flow Visualization, (2006) PDF.
- Makoto Uchida and Naoki Shibata. Identification and Visualization of Emerging Trends from Blogosphere. submitted to International Conference on Weblogs and Social Media, (2007) PDF.
Terms of Use
You can use these datasets freely, as long as the purpose is for an ACADEMIC USE. When you like to use them, we would like to request you to refer to this website, or any of the related papers listed above.
In addition, we will appreciate it if you send us a copy of any papers you produce using these datasets. You may email us directly at the contact address below. We thank you for your cooperation. :-)
Download
Dataset #1
25668 entries, 135656 trackbacks traced from 1 seed.
compressed files (blogdata1.tgz, 1.3 megabytes)
Dataset #2
201824 entries, 1253078 trackbacks traced from 7183 seeds.
compressed files (blogdata2.tgz, 12 megabytes)
Visualizations
Visualized by LGL (Large Graph Layout). Edges are colored according to a community division.
Community Expression of Dataset #1
Click for large images.
Normai Size (2400x1800 PNG, 3.38 megabytes),
Large Size (8000x6000 PNG, 2.75 megabytes).
3-Dimensional Expression of Dataset #1
Click for a large image (2400x1800 PNG, 905 kilobytes).
Evolving Process of Dataset #1
Click image for movie (MPEG1 movie, 5.00 megabytes).
Visual Analysis on Dynamics of Blogosphere Network
Submitted to Competition on Visualizing Network Dynamics at NetSci 2007, based on Dataset #1.
Click forlarge images.
Normal Size (1949x1571 JPEG, 688 kilobytes),
Full Res Image (8119x6544 TIFF, 152 megabytes).