Glenn Glenn - 1 month ago 21
C Question

Convert word2vec bin file to text

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand

Supposedly gensim can do this also, but all the tutorials I've found seem to be about converting from text, not the other way.

Can someone suggest modifications to the C code or instructions for gensim to emit text?


On the word2vec-toolkit mailing list Thomas Mensink has provided an answer in the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas's code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.bin will create a text version of vector.bin.

//  Copyright 2013 Google Inc. All Rights Reserved.
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char file_name[max_size];
  float len;
  long long words, size, a, b;
  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  fscanf(f, "%lld", &words);
  fscanf(f, "%lld", &size);
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  for (b = 0; b < words; b++) {
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  //Code added by Thomas Mensink
  //output the vectors of the binary format in text
  printf("%lld %lld #File: %s\n",words,size,file_name);
  for (a = 0; a < words; a++){
    printf("%s ",&vocab[a * max_w]);
    for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }

  return 0;

I removed the "\b\b" from the printf.

By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.

cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt