Here we provide two datasets of weblog entries for download. The datasets contain following metadata of weblog entries, mainly compopsed of those written in Japanese.
- Unique IDs of enties.
- URIs of entries.
- Posted dates of entries.
- Trackbacks between entries.
The datasets had been gathered by crawling from the WWW. Tha crawling was operated from randomly selected entries, then tracing back trackbacks pointing to the entry to the sources of the trackbacks, and so on, until every entries connected by trackbacks had been gathered.
Each dataset consists of two, tab-separeted text files (*.tsv). One (entry.tsv) contains unique IDs, URIs and posted date of entries in each column. Some sample is presented below.146390 http://sugamo.jugem.cc/?eid=268 2004-10-14 00:00:00
146540 http://numazu.jugem.jp/?eid=333 2004-10-13 00:00:00
146541 http://redkyudan.jugem.jp/?eid=372 2004-10-13 00:00:00
146538 http://brownshoes.jugem.cc/?eid=171 2004-10-13 00:00:00
Posted dates are obtained by parsing the body text of the entries. Because of this, they might not represent the true timestamps. Entries the posted date of which cannot properly be parsed are not included in the dataset.
The other (trackback.tsv) contains pairs of entry IDs, which represent trackbacks. First column of each row is an entry ID of the source of a trackback, and second column is the destination. Please not that trackbacks are directed. Here is some sample.146963 115934
The seed URIs of crawling process are listed in another text file (seeds.txt).
- Makoto Uchida and Naoki Shibata. Extracting and Visualization of an Emerging Topic from the Blogspace. In proceedings of the 20th Annual Conference of the Japanese Society for Artifical Inteligence, (2006) PDF (In Japanese).
- Makoto Uchida and Susumu Shirayama. Formation of patterns from complex networks. In proceedings of 12th International Symposium on Flow Visualization, (2006) PDF.
- Makoto Uchida and Naoki Shibata. Identification and Visualization of Emerging Trends from Blogosphere. submitted to International Conference on Weblogs and Social Media, (2007) PDF.
You can use these datasets freely, as long as the purpose is for an ACADEMIC USE. When you like to use them, we would like to request you to refer to this website, or any of the related papers listed above.
In addition, we will appreciate it if you send us a copy of any papers you produce using these datasets. You may email us directly at the contact address below. We thank you for your cooperation. :-)
25668 entries, 135656 trackbacks traced from 1 seed.
compressed files (blogdata1.tgz, 1.3 megabytes)
201824 entries, 1253078 trackbacks traced from 7183 seeds.
compressed files (blogdata2.tgz, 12 megabytes)
Visualized by LGL (Large Graph Layout). Edges are colored according to a community division.
Community Expression of Dataset #1
3-Dimensional Expression of Dataset #1
Evolving Process of Dataset #1
Visual Analysis on Dynamics of Blogosphere Network
Submitted to Competition on Visualizing Network Dynamics at NetSci 2007, based on Dataset #1.