Personal tools
You are here: Home LINC Lab Available Data

Available Data

Information about the data available to the LINC users

Introduction

Over the past few years, the members of the LINC group have been collecting and generating data for their experiments.   This page will try to summarize the available data.

Note About Path Information

When directory information is given, it is given as if the users are accessing the data via the CACS or LINC Linux Machines.

To access the data via Windows, you would need to use the Samba mappings.  Please reference this page for more information.

Feature Based

XML - Beer

In addition, we have a small Beer database, acquired from the BeerSmith Brewing Company in 2007 that was been used in clustering experiments; primarily, to see if the beer styles and categories could be recovered.  This can be found at /nimbus/data/FeatureBase/Beer.

Imagery

Images, Various

A number of images have been acquired over the past years to test various image processing concepts.  At present, the images include:

  • 811 Images of Buildings
  • 16 Images of Interior Walls/Floors
  • 241 Aerial Images (sequence)
  • 525 Pairs of Images with Lidar Data
  • 1769 Pairs of Images without Lidar Data

These images are stored within the CACS network at /nimbus/data/Imagery/ImageArchives.

Images, Highway

As part of an ongoing project, we have access to over 400 Gigabytes of high resolution images taken of Louisiana Highways from the driver's point of view.  These images are available at /nimbus/data/Imagery/Highway.

Landsat

In 2005, a few of Landsat images were acquired as part of an image annotation experiment.  The raw imagery, the processed TIFF images, and the annotations can all be found, within the CACS network, at /nimbus/data/Imagery/Landsat

Images, Object Class Recognition

In Summer 2009, we downloaded 4 data sets from the Object Class Recognition Microsoft Research in Cambridge web page.  These four sets are

  • Database of thousands of weakly labelled, high-res images.
  • Pixel-wise labelled image database v1.
  • Pixel-wise labelled image database v2.
  • Pixel-wise labelled image database of textile materials.

 These image collections, along with the license information, can be found at /nimbus/data/Imagery/ImagesFromMicrosoft.

Image Annotation Sets

Two additional images exists, iaprctc12 and saiaprtc12, are available for those investigating image annotation.  The datasets can be found at"

  • iaprtc12: /nimbus/data/Imagery/iaprtc12
  • saiaprtc12: /nimbus/data/Imagery/saiaprtc12

Text

Reuters Corpus 1

This corpus is compose of about 810,000 news articles, written in English, collected from August 20, 1996  to August 19, 1997.  Each article has been tagged with respect to topic, industry association, and location.  This was acquired from NIST, at this site.  NOTE: You must see talk to Dr. Raghavan to use this dataset.

Reuters Corpus 1 - version 2

A preprocessed version of the Retuers Corpus 1, created by David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li.  This was acquired from website of David Lewis, at this location.

Reuters Corpus 2

This corpus is compose of about 487,000 news articles, in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish).  They were collected from August 20, 1996  to August 19, 1997.  Each article has been tagged with respect to topic, industry association, and location.  Note, the stories are not parallel.  This was acquired from NIST at this site.  NOTE: You must see talk to Dr. Raghavan to use this dataset.

TREC

Around 2002/2003, three TREC collections were acquired.  These are:

  • Web TREC
  • Trec-5
  • Trec-4

More information about the collections can be found here (access restricted).  It was acquired from NIST.

Webspam-UK2006, Summary Version

A collection of 4,560,800 webpages, collected by Yahoo, that has been labeled as spam, borderline or normal.  Other information, such as linkages and scores are also included.  We acquired this from Yahoo Research Lab, at this website.  NOTE: You must see talk to Dr. Raghavan to use this dataset.

XML - PubMed Central

We have also acquired a set of biomedical and life sciences articles, stored in XML format, from the PubMed Central Open Access Subset collection.  The local copy can be found at /nimbus/textData/PubMedOpenAccess.

Mixed Types

Record Extraction/Next Page Link

We have 19 data sets, created by Dheerendranath Mundluru, which pertain to work in record extraction (from web pages) and 'next page link' detection.  The data sets can be found at /nimbus/data/Mixed/Dheeru-WebExtraction.

ADNI Data

We have acquired MRI, PET and clinical data from the Alzheimer's Disease Neuroimaging Initiative.  Subsets of the PET data have been processed to create feature-based data.  This data is available to those working with Drs. Raghavan, Chu, and Benton for those interested in applied data mining as well as the development of new algorithms.

Document Actions
« February 2012 »
February
MoTuWeThFrSaSu
12345
6789101112
13141516171819
20212223242526
272829