An ode to Metro.



  1. The SWELL (Spectral Word Embedding Learning for Language) JAVA toolkit for inducing word embeddings is available here.

  2. The ANTsR toolkit for medical image analysis (including the implementation of our NeuroImage 2014 paper) is available here.

  3. Various Eigenword (SWELL) embeddings for English can be found here (Used in JMLR results) *No additional scaling required for embeddings. Use them as is.*

    Our recommendation: Based on our experiments, OSCCA and TSCCA embeddings are the most robust and work the best on a variety of tasks. They are closely followed by LR-MVL(I).

    Eigenword Name Details
    OSCCA (h=2) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=2
    TSCCA (h=2) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=2
    LR-MVL(I) (h=2) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=2
    LR-MVL(II) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, smooths=0.5
    OSCCA (h=10) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=10
    TSCCA (h=10) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=10
    LR-MVL(I) (h=10) Trained on Reuters RCV1 (No lowercasing or cleaning). v=100k, k=200, context size (h)=10
  4. Generic eigenwords embeddings for various languages (Trained on much larger corpora).

    Language Details
    English Trained on English Gigaword (No lowercasing or cleaning). v=300k, k=200.
    German Trained on German Newswire (No lowercasing or cleaning). v=300k, k=200, context size (h)=2
    French Trained on French Gigaword (No lowercasing or cleaning). v=300k, k=200.
    Spanish Trained on Spanish Gigaword (No lowercasing or cleaning). v=300k, k=200.
    Italian Trained on Italian Newswire+Wiki (No lowercasing or cleaning). v=300k, k=200.
    Dutch Trained on Dutch Newswire+Wiki (No lowercasing or cleaning). v=300k, k=200.
    Chinese (Simplified) (Characters) Trained on Chinese Gigaword. v=11k, k=200.
    Chinese (Simplified) (Stanford Tokenizer) Trained on Chinese Gigaword. v=300k, k=200.