Alexander Whatley Alexander Whatley - 1 month ago 5
Python Question

Pandas fails to read past 216th line of jagged text file

I have a jagged txt file (different number of columns for each row), and am trying to read it in in Pandas. For some reason, it can read in the first 216 lines, but not the first 217 lines.

>>> df = pd.read_table("test.txt", names = range(2000), nrows = 216)
>>> df = pd.read_table("test.txt", names = range(2000), nrows = 217)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 321, in _read
return parser.read(nrows)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/Users/alexwhatley/anaconda3/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9208)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


The file is here: https://gist.github.com/alexanderwhatley/e07af297b1a10cd5cb57c7b75ee7f229. Does anyone know what is going on?

Answer

A work-around would be:

import pandas as pd

the_file = []
with open(r"./genes.txt", 'rb') as f:
    for line in f:
        the_file.append(line.split('\t'))

df = pd.DataFrame(the_file,columns=range(max([len(l) for l in the_file])))

print df[0]

result:

0                       KEGG_GLYCOLYSIS_GLUCONEOGENESIS
1                          KEGG_CITRATE_CYCLE_TCA_CYCLE
2                        KEGG_PENTOSE_PHOSPHATE_PATHWAY
3         KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS
4                  KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM
5                             KEGG_GALACTOSE_METABOLISM
6                KEGG_ASCORBATE_AND_ALDARATE_METABOLISM
7                            KEGG_FATTY_ACID_METABOLISM
8                             KEGG_STEROID_BIOSYNTHESIS
9                   KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS
10                    KEGG_STEROID_HORMONE_BIOSYNTHESIS
11                       KEGG_OXIDATIVE_PHOSPHORYLATION
12                               KEGG_PURINE_METABOLISM
13                           KEGG_PYRIMIDINE_METABOLISM
14      KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM
15         KEGG_GLYCINE_SERINE_AND_THREONINE_METABOLISM
16              KEGG_CYSTEINE_AND_METHIONINE_METABOLISM
17       KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION
18      KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_BIOSYNTHESIS
19                              KEGG_LYSINE_DEGRADATION
20                 KEGG_ARGININE_AND_PROLINE_METABOLISM
21                            KEGG_HISTIDINE_METABOLISM
22                             KEGG_TYROSINE_METABOLISM
23                        KEGG_PHENYLALANINE_METABOLISM
24                           KEGG_TRYPTOPHAN_METABOLISM
25                         KEGG_BETA_ALANINE_METABOLISM
26              KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM
27                     KEGG_SELENOAMINO_ACID_METABOLISM
28                          KEGG_GLUTATHIONE_METABOLISM
29                   KEGG_STARCH_AND_SUCROSE_METABOLISM
                             ...                       
425                                      ST_GAQ_PATHWAY
426                                     ST_GA13_PATHWAY
427                                    ST_STAT3_PATHWAY
428                                    SA_FAS_SIGNALING
429                                  SA_G1_AND_S_PHASES
430    SIG_INSULIN_RECEPTOR_PATHWAY_IN_CARDIAC_MYOCYTES
431                       ST_T_CELL_SIGNAL_TRANSDUCTION
432                        ST_TYPE_I_INTERFERON_PATHWAY
433                            ST_PAC1_RECEPTOR_PATHWAY
434                 SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
435                           SIG_BCR_SIGNALING_PATHWAY
436                                  SA_G2_AND_M_PHASES
437                          ST_B_CELL_ANTIGEN_RECEPTOR
438                            ST_INTERLEUKIN_4_PATHWAY
439                         ST_WNT_BETA_CATENIN_PATHWAY
440                          SA_MMP_CYTOKINE_CONNECTION
441                                 ST_JNK_MAPK_PATHWAY
442                            SA_PROGRAMMED_CELL_DEATH
443                            ST_FAS_SIGNALING_PATHWAY
444                               ST_MYOCYTE_AD_PATHWAY
445                                     SA_PTEN_PATHWAY
446                       SA_REG_CASCADE_OF_CYCLIN_EXPR
447                                    SA_TRKA_RECEPTOR
448                ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY
449                                 PID_FANCONI_PATHWAY
450                          PID_SMAD2_3NUCLEAR_PATHWAY
451                                   PID_FCER1_PATHWAY
452                              PID_ENDOTHELIN_PATHWAY
453                                    PID_BCR_5PATHWAY
454                    PID_PRL_SIGNALING_EVENTS_PATHWAY
Comments