Error loading tables into omni sci in knn_model.py #13

jakerbrown · 2020-09-14T15:45:18Z

While running the modified knn_model.py script I got the following error. It appears to be related to converting the merged table to omnisci, but I do not know what the error message "Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array" means or how to fix it:

(omnisci) [jbrown613@holygpu2c0705 neighbors]$ time python3 ~/sql/knn_model_merge.py
Connecting to Omnisci
Connected Connection(omnisci://admin:***@localhost:9893/omnisci?protocol=binary)
Traceback (most recent call last):
File "/n/home09/jbrown613/sql/knn_model_merge.py", line 37, in
conn.load_table("m",m,create='infer',method='arrow')
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 687, in load_table
return self.load_table_arrow(table_name, data)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 835, in load_table_arrow
data, metadata, preserve_index=preserve_index
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/_pandas_loaders.py", line 248, in serialize_arrow_payload
data = pa.RecordBatch.from_pandas(data, preserve_index=preserve_index)
File "pyarrow/table.pxi", line 704, in pyarrow.lib.RecordBatch.from_pandas
File "pyarrow/table.pxi", line 749, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

real 5m54.114s
user 5m11.406s
sys 0m25.939s

dkakkar · 2020-09-14T17:29:15Z

Is your tablename and dataframe name both "m"?

jakerbrown · 2020-09-14T17:29:57Z

Yes. Is that an issue?

…

On Sep 14, 2020, at 1:29 PM, dkakkar ***@***.***> wrote: Is your tablename and dataframe name both "m"? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUE34IUH4PDMVTZDXZ3SFZHHVANCNFSM4RLYKCIA>.

dkakkar · 2020-09-14T17:34:33Z

Pls share your script with me on email.

dkakkar · 2020-09-14T17:40:07Z

conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output:

print(m.head())

jakerbrown · 2020-09-14T17:42:52Z

Hi Devika, Yes, it appears that is the line that is causing error. Here is the printed output:

>> m.head(5)

dpost rpost neighbor_id 0 0.0 0.0 AK-630667 1 0.0 0.0 AK-701587 2 0.0 0.0 AK-656813 3 0.0 0.0 AK-656812 4 0.0 0.0 AK-701520

…

On Sep 14, 2020, at 1:40 PM, dkakkar ***@***.***> wrote: conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output: print(m.head()) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFSEICPFAXX2WVUIE3SFZIQNANCNFSM4RLYKCIA>.

dkakkar · 2020-09-14T17:46:19Z

Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')":

conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);")
conn.load_table_columnar("mrg", m,preserve_index=False)

jakerbrown · 2020-09-14T20:57:20Z

This seems to work. The current problem arises from reading in the knn output file (in this case it is knn_1000_CA1_2012.tar.gz). It seems to be timing out or hitting a memory limit?

>> df = pd.read_csv(filename, sep=',',dtype='unicode',index_col=None, low_memory='true',compression='gzip')

Killed

…

On Sep 14, 2020, at 1:46 PM, dkakkar ***@***.***> wrote: Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')": conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);") conn.load_table_columnar("mrg", m,preserve_index=False) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUGLJN7HMWRFL6X4TQLSFZJHVANCNFSM4RLYKCIA>.

dkakkar · 2020-09-14T21:13:11Z

What GPU memory are you using?Please send parameters of your job.

dkakkar · 2020-09-14T21:24:14Z

Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory.

jakerbrown · 2020-09-15T03:57:16Z

In testing this on the Rhode Island file, I am able to load the knn_1000_RI1_2012.tar.gz file, but it does not look like the data frame we expect: >> df.head(5) knn_1000_RI_2012.csv 0 PA-000007920358\tPA-10407918\td\tr\t40.2433976... 1 PA-000007920358\tPA-10408513\td\tr\t40.2433976... 2 PA-000007920358\tPA-000006487459\td\td\t40.243... 3 PA-000007920358\tPA-000006909098\td\td\t40.243... 4 PA-000007920358\tPA-000000307624\td\tr\t40.243...

>>

I think this means the sep is “\t” not “,”?

…

On Sep 14, 2020, at 5:24 PM, dkakkar ***@***.***> wrote: Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFFH5DJ4OP6D5O5TADSF2CY3ANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T03:59:28Z

Yes, separator is '\t' but in your script you mentioned ',', no?

jakerbrown · 2020-09-15T04:50:11Z

So I can upload the file, but when I try to load it to Omni Sci the process dies:

>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);")

<pymapd.cursor.Cursor object at 0x2b96a3dd4048>

>> conn.load_table_columnar("knn", df,preserve_index=False)

Killed

…

On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote: Yes, separator is '\t' but in your script you mentioned ',', no? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T11:57:03Z

This seems like memory issue. Pls send me the parameters you used to launch the job.

…

On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> wrote: So I can upload the file, but when I try to load it to Omni Sci the process dies: >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);") <pymapd.cursor.Cursor object at 0x2b96a3dd4048> >>> conn.load_table_columnar("knn", df,preserve_index=False) Killed > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote: > > > Yes, separator is '\t' but in your script you mentioned ',', no? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA> .

dkakkar · 2020-09-15T14:17:26Z

Pls use 256 GB ram, 2 CPU, 1GPU machine. On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <[email protected]> wrote:

…

This seems like memory issue. Pls send me the parameters you used to launch the job. On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> wrote: > So I can upload the file, but when I try to load it to Omni Sci the > process dies: > > >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING > NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, > neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > >>> conn.load_table_columnar("knn", df,preserve_index=False) > Killed > > > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote: > > > > > > Yes, separator is '\t' but in your script you mentioned ',', no? > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA> > . >

jakerbrown · 2020-09-15T14:18:46Z

That is what Im using, I believe.

…

On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote: Pls use 256 GB ram, 2 CPU, 1GPU machine. On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***> wrote: > This seems like memory issue. Pls send me the parameters you used to > launch the job. > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> > wrote: > >> So I can upload the file, but when I try to load it to Omni Sci the >> process dies: >> >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> >> >>> conn.load_table_columnar("knn", df,preserve_index=False) >> Killed >> >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote: >> > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', no? >> > >> > — >> > You are receiving this because you authored the thread. >> > Reply to this email directly, view it on GitHub < >> #13 (comment)>, >> or unsubscribe < >> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA >> >. >> > >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#13 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA> >> . >> > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dkakkar · 2020-09-15T14:23:06Z

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt. On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <[email protected]> wrote:

…

That is what Im using, I believe. > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote: > > > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***> > wrote: > > > This seems like memory issue. Pls send me the parameters you used to > > launch the job. > > > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> > > wrote: > > > >> So I can upload the file, but when I try to load it to Omni Sci the > >> process dies: > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > >> Killed > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote: > >> > > >> > > >> > Yes, separator is '\t' but in your script you mentioned ',', no? > >> > > >> > — > >> > You are receiving this because you authored the thread. > >> > Reply to this email directly, view it on GitHub < > >> #13 (comment) >, > >> or unsubscribe < > >> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > >> >. > >> > > >> > >> — > >> You are receiving this because you commented. > >> Reply to this email directly, view it on GitHub > >> < #13 (comment) >, > >> or unsubscribe > >> < https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > > >> . > >> > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA> .

dkakkar · 2020-09-15T14:24:51Z

Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue. On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <[email protected]> wrote:

…

Please recheck the parameters and if it still fails with 256 GB then first check with FASRC help email if memory is the reason of it's failure. If memory is the reason then you will have to divide the file in smaller chunks to model it because FASRC does not allow more than 256GB on GPU. While dividing into smaller chunks make sure you include all neighbors of a voter in the file. For e.g if you take voter id 1 to 100 then the file should have all 1000 neighbors for voter id 1-100 else the modelling will be corrupt. On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***> wrote: > That is what Im using, I believe. > > > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote: > > > > > > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > > > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.*** > > > > wrote: > > > > > This seems like memory issue. Pls send me the parameters you used to > > > launch the job. > > > > > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> > > > wrote: > > > > > >> So I can upload the file, but when I try to load it to Omni Sci the > > >> process dies: > > >> > > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT > ENCODING > > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, > > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > > >> Killed > > >> > > >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> > wrote: > > >> > > > >> > > > >> > Yes, separator is '\t' but in your script you mentioned ',', no? > > >> > > > >> > — > > >> > You are receiving this because you authored the thread. > > >> > Reply to this email directly, view it on GitHub < > > >> > #13 (comment) > >, > > >> or unsubscribe < > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > > >> >. > > >> > > > >> > > >> — > > >> You are receiving this because you commented. > > >> Reply to this email directly, view it on GitHub > > >> < > #13 (comment) > >, > > >> or unsubscribe > > >> < > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > > > > >> . > > >> > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub, or unsubscribe. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA> > . >

jakerbrown · 2020-09-15T14:27:06Z

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?

…

On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote: Also, I would suggest testing with the smallest input file (smaller than RI) you have in hand so that we are sure that the script is correct before we solve the memory scaling issue. On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.***> wrote: > Please recheck the parameters and if it still fails with 256 GB then > first check with FASRC help email if memory is the reason of it's failure. > If memory is the reason then you will have to divide the file in smaller > chunks to model it because FASRC does not allow more than 256GB on GPU. > While dividing into smaller chunks make sure you include all neighbors of a > voter in the file. For e.g if you take voter id 1 to 100 then the file > should have all 1000 neighbors for voter id 1-100 else the modelling will > be corrupt. > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***> > wrote: > >> That is what Im using, I believe. >> >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote: >> > >> > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. >> > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.*** >> > >> > wrote: >> > >> > > This seems like memory issue. Pls send me the parameters you used to >> > > launch the job. >> > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> >> > > wrote: >> > > >> > >> So I can upload the file, but when I try to load it to Omni Sci the >> > >> process dies: >> > >> >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT >> ENCODING >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) >> > >> Killed >> > >> >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> >> wrote: >> > >> > >> > >> > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', no? >> > >> > >> > >> > — >> > >> > You are receiving this because you authored the thread. >> > >> > Reply to this email directly, view it on GitHub < >> > >> >> #13 (comment) >> >, >> > >> or unsubscribe < >> > >> >> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA >> > >> >. >> > >> > >> > >> >> > >> — >> > >> You are receiving this because you commented. >> > >> Reply to this email directly, view it on GitHub >> > >> < >> #13 (comment) >> >, >> > >> or unsubscribe >> > >> < >> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA >> > >> > >> . >> > >> >> > > >> > — >> > You are receiving this because you authored the thread. >> > Reply to this email directly, view it on GitHub, or unsubscribe. >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#13 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA> >> . >> > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dkakkar · 2020-09-15T14:29:32Z

Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB. On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown <[email protected]> wrote:

…

Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output? > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote: > > > Also, I would suggest testing with the smallest input file (smaller than > RI) you have in hand so that we are sure that the script is correct before > we solve the memory scaling issue. > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.*** > > wrote: > > > Please recheck the parameters and if it still fails with 256 GB then > > first check with FASRC help email if memory is the reason of it's failure. > > If memory is the reason then you will have to divide the file in smaller > > chunks to model it because FASRC does not allow more than 256GB on GPU. > > While dividing into smaller chunks make sure you include all neighbors of a > > voter in the file. For e.g if you take voter id 1 to 100 then the file > > should have all 1000 neighbors for voter id 1-100 else the modelling will > > be corrupt. > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.*** > > > wrote: > > > >> That is what Im using, I believe. > >> > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote: > >> > > >> > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. > >> > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < ***@***.*** > >> > > >> > wrote: > >> > > >> > > This seems like memory issue. Pls send me the parameters you used to > >> > > launch the job. > >> > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < ***@***.***> > >> > > wrote: > >> > > > >> > >> So I can upload the file, but when I try to load it to Omni Sci the > >> > >> process dies: > >> > >> > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT > >> ENCODING > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > >> > >> Killed > >> > >> > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar < ***@***.***> > >> wrote: > >> > >> > > >> > >> > > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', no? > >> > >> > > >> > >> > — > >> > >> > You are receiving this because you authored the thread. > >> > >> > Reply to this email directly, view it on GitHub < > >> > >> > >> #13 (comment) > >> >, > >> > >> or unsubscribe < > >> > >> > >> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > >> > >> >. > >> > >> > > >> > >> > >> > >> — > >> > >> You are receiving this because you commented. > >> > >> Reply to this email directly, view it on GitHub > >> > >> < > >> #13 (comment) > >> >, > >> > >> or unsubscribe > >> > >> < > >> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > >> > > >> > >> . > >> > >> > >> > > > >> > — > >> > You are receiving this because you authored the thread. > >> > Reply to this email directly, view it on GitHub, or unsubscribe. > >> > >> — > >> You are receiving this because you commented. > >> Reply to this email directly, view it on GitHub > >> < #13 (comment) >, > >> or unsubscribe > >> < https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA > > >> . > >> > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-15T15:03:33Z

Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.

…

On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote: Yes, that is what you would have to do ultimately for bigger files. Please divide it in smaller groups and try again but before that check with FASRC is memory is indeed the issue even with 256GB. On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***> wrote: > Okay. How do I go about dividing it? By creating smaller groups from the > outset when generating knn output? > > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote: > > > > > > Also, I would suggest testing with the smallest input file (smaller than > > RI) you have in hand so that we are sure that the script is correct > before > > we solve the memory scaling issue. > > > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.*** > > > > wrote: > > > > > Please recheck the parameters and if it still fails with 256 GB then > > > first check with FASRC help email if memory is the reason of it's > failure. > > > If memory is the reason then you will have to divide the file in > smaller > > > chunks to model it because FASRC does not allow more than 256GB on GPU. > > > While dividing into smaller chunks make sure you include all neighbors > of a > > > voter in the file. For e.g if you take voter id 1 to 100 then the file > > > should have all 1000 neighbors for voter id 1-100 else the modelling > will > > > be corrupt. > > > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.*** > > > > > wrote: > > > > > >> That is what Im using, I believe. > > >> > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> > wrote: > > >> > > > >> > > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > >> > > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < > ***@***.*** > > >> > > > >> > wrote: > > >> > > > >> > > This seems like memory issue. Pls send me the parameters you used > to > > >> > > launch the job. > > >> > > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < > ***@***.***> > > >> > > wrote: > > >> > > > > >> > >> So I can upload the file, but when I try to load it to Omni Sci > the > > >> > >> process dies: > > >> > >> > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT > > >> ENCODING > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING > NONE, > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > > >> > >> Killed > > >> > >> > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar < > ***@***.***> > > >> wrote: > > >> > >> > > > >> > >> > > > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', > no? > > >> > >> > > > >> > >> > — > > >> > >> > You are receiving this because you authored the thread. > > >> > >> > Reply to this email directly, view it on GitHub < > > >> > >> > > >> > #13 (comment) > > >> >, > > >> > >> or unsubscribe < > > >> > >> > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > > >> > >> >. > > >> > >> > > > >> > >> > > >> > >> — > > >> > >> You are receiving this because you commented. > > >> > >> Reply to this email directly, view it on GitHub > > >> > >> < > > >> > #13 (comment) > > >> >, > > >> > >> or unsubscribe > > >> > >> < > > >> > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > > >> > > > >> > >> . > > >> > >> > > >> > > > > >> > — > > >> > You are receiving this because you authored the thread. > > >> > Reply to this email directly, view it on GitHub, or unsubscribe. > > >> > > >> — > > >> You are receiving this because you commented. > > >> Reply to this email directly, view it on GitHub > > >> < > #13 (comment) > >, > > >> or unsubscribe > > >> < > https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA > > > > >> . > > >> > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub, or unsubscribe. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T15:05:43Z

If overall session does not die then it might not be GPU memory issue. Please run python script in screen .

…

On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.***> wrote: Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script. > On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote: > > > Yes, that is what you would have to do ultimately for bigger files. Please > divide it in smaller groups and try again but before that check with FASRC > is memory is indeed the issue even with 256GB. > > On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***> > wrote: > > > Okay. How do I go about dividing it? By creating smaller groups from the > > outset when generating knn output? > > > > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote: > > > > > > > > > Also, I would suggest testing with the smallest input file (smaller than > > > RI) you have in hand so that we are sure that the script is correct > > before > > > we solve the memory scaling issue. > > > > > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar < ***@***.*** > > > > > > wrote: > > > > > > > Please recheck the parameters and if it still fails with 256 GB then > > > > first check with FASRC help email if memory is the reason of it's > > failure. > > > > If memory is the reason then you will have to divide the file in > > smaller > > > > chunks to model it because FASRC does not allow more than 256GB on GPU. > > > > While dividing into smaller chunks make sure you include all neighbors > > of a > > > > voter in the file. For e.g if you take voter id 1 to 100 then the file > > > > should have all 1000 neighbors for voter id 1-100 else the modelling > > will > > > > be corrupt. > > > > > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown < ***@***.*** > > > > > > > wrote: > > > > > > > >> That is what Im using, I believe. > > > >> > > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.*** > > > wrote: > > > >> > > > > >> > > > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > > >> > > > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < > > ***@***.*** > > > >> > > > > >> > wrote: > > > >> > > > > >> > > This seems like memory issue. Pls send me the parameters you used > > to > > > >> > > launch the job. > > > >> > > > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < > > ***@***.***> > > > >> > > wrote: > > > >> > > > > > >> > >> So I can upload the file, but when I try to load it to Omni Sci > > the > > > >> > >> process dies: > > > >> > >> > > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT > > > >> ENCODING > > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING > > NONE, > > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > > > >> > >> Killed > > > >> > >> > > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar < > > ***@***.***> > > > >> wrote: > > > >> > >> > > > > >> > >> > > > > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', > > no? > > > >> > >> > > > > >> > >> > — > > > >> > >> > You are receiving this because you authored the thread. > > > >> > >> > Reply to this email directly, view it on GitHub < > > > >> > >> > > > >> > > #13 (comment) > > > >> >, > > > >> > >> or unsubscribe < > > > >> > >> > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > > > >> > >> >. > > > >> > >> > > > > >> > >> > > > >> > >> — > > > >> > >> You are receiving this because you commented. > > > >> > >> Reply to this email directly, view it on GitHub > > > >> > >> < > > > >> > > #13 (comment) > > > >> >, > > > >> > >> or unsubscribe > > > >> > >> < > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > > > >> > > > > >> > >> . > > > >> > >> > > > >> > > > > > >> > — > > > >> > You are receiving this because you authored the thread. > > > >> > Reply to this email directly, view it on GitHub, or unsubscribe. > > > >> > > > >> — > > > >> You are receiving this because you commented. > > > >> Reply to this email directly, view it on GitHub > > > >> < > > #13 (comment) > > >, > > > >> or unsubscribe > > > >> < > > https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA > > > > > > >> . > > > >> > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub, or unsubscribe. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA> .

dkakkar · 2020-09-15T15:06:19Z

Yes I am running it in screen.

…

On Sep 15, 2020, at 11:05 AM, Devika Kakkar ***@***.***> wrote: If overall session does not die then it might not be GPU memory issue. Please run python script in screen . On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.*** ***@***.***>> wrote: Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script. > On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.*** ***@***.***>> wrote: > > > Yes, that is what you would have to do ultimately for bigger files. Please > divide it in smaller groups and try again but before that check with FASRC > is memory is indeed the issue even with 256GB. > > On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.*** ***@***.***>> > wrote: > > > Okay. How do I go about dividing it? By creating smaller groups from the > > outset when generating knn output? > > > > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.*** ***@***.***>> wrote: > > > > > > > > > Also, I would suggest testing with the smallest input file (smaller than > > > RI) you have in hand so that we are sure that the script is correct > > before > > > we solve the memory scaling issue. > > > > > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.*** ***@***.***> > > > > > > wrote: > > > > > > > Please recheck the parameters and if it still fails with 256 GB then > > > > first check with FASRC help email if memory is the reason of it's > > failure. > > > > If memory is the reason then you will have to divide the file in > > smaller > > > > chunks to model it because FASRC does not allow more than 256GB on GPU. > > > > While dividing into smaller chunks make sure you include all neighbors > > of a > > > > voter in the file. For e.g if you take voter id 1 to 100 then the file > > > > should have all 1000 neighbors for voter id 1-100 else the modelling > > will > > > > be corrupt. > > > > > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.*** ***@***.***> > > > > > > > wrote: > > > > > > > >> That is what Im using, I believe. > > > >> > > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.*** ***@***.***>> > > wrote: > > > >> > > > > >> > > > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > > >> > > > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < > > ***@***.*** ***@***.***> > > > >> > > > > >> > wrote: > > > >> > > > > >> > > This seems like memory issue. Pls send me the parameters you used > > to > > > >> > > launch the job. > > > >> > > > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < > > ***@***.*** ***@***.***>> > > > >> > > wrote: > > > >> > > > > > >> > >> So I can upload the file, but when I try to load it to Omni Sci > > the > > > >> > >> process dies: > > > >> > >> > > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT > > > >> ENCODING > > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING > > NONE, > > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > > > >> > >> Killed > > > >> > >> > > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar < > > ***@***.*** ***@***.***>> > > > >> wrote: > > > >> > >> > > > > >> > >> > > > > >> > >> > Yes, separator is '\t' but in your script you mentioned ',', > > no? > > > >> > >> > > > > >> > >> > — > > > >> > >> > You are receiving this because you authored the thread. > > > >> > >> > Reply to this email directly, view it on GitHub < > > > >> > >> > > > >> > > #13 (comment) <#13 (comment)> > > > >> >, > > > >> > >> or unsubscribe < > > > >> > >> > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA> > > > >> > >> >. > > > >> > >> > > > > >> > >> > > > >> > >> — > > > >> > >> You are receiving this because you commented. > > > >> > >> Reply to this email directly, view it on GitHub > > > >> > >> < > > > >> > > #13 (comment) <#13 (comment)> > > > >> >, > > > >> > >> or unsubscribe > > > >> > >> < > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA> > > > >> > > > > >> > >> . > > > >> > >> > > > >> > > > > > >> > — > > > >> > You are receiving this because you authored the thread. > > > >> > Reply to this email directly, view it on GitHub, or unsubscribe. > > > >> > > > >> — > > > >> You are receiving this because you commented. > > > >> Reply to this email directly, view it on GitHub > > > >> < > > #13 (comment) <#13 (comment)> > > >, > > > >> or unsubscribe > > > >> < > > https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA> > > > > > > >> . > > > >> > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub, or unsubscribe. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#13 (comment) <#13 (comment)>>, > > or unsubscribe > > <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA>> > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA>>. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T15:09:09Z

Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script: pd.read_csv(filename, chunksize=chunksize)

…

On Tue, Sep 15, 2020 at 11:06 AM Jacob Brown ***@***.***> wrote: Yes I am running it in screen. On Sep 15, 2020, at 11:05 AM, Devika Kakkar ***@***.***> wrote: If overall session does not die then it might not be GPU memory issue. Please run python script in screen . On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.***> wrote: > > Okay I am in the process of re-running it. I should also note that the > FASRC overall session does not die, just the python3 session activated by > the knn_model.py script. > > > On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote: > > > > > > Yes, that is what you would have to do ultimately for bigger files. > Please > > divide it in smaller groups and try again but before that check with > FASRC > > is memory is indeed the issue even with 256GB. > > > > On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***> > > wrote: > > > > > Okay. How do I go about dividing it? By creating smaller groups from > the > > > outset when generating knn output? > > > > > > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> > wrote: > > > > > > > > > > > > Also, I would suggest testing with the smallest input file (smaller > than > > > > RI) you have in hand so that we are sure that the script is correct > > > before > > > > we solve the memory scaling issue. > > > > > > > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar < > ***@***.*** > > > > > > > > wrote: > > > > > > > > > Please recheck the parameters and if it still fails with 256 GB > then > > > > > first check with FASRC help email if memory is the reason of it's > > > failure. > > > > > If memory is the reason then you will have to divide the file in > > > smaller > > > > > chunks to model it because FASRC does not allow more than 256GB > on GPU. > > > > > While dividing into smaller chunks make sure you include all > neighbors > > > of a > > > > > voter in the file. For e.g if you take voter id 1 to 100 then the > file > > > > > should have all 1000 neighbors for voter id 1-100 else the > modelling > > > will > > > > > be corrupt. > > > > > > > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown < > ***@***.*** > > > > > > > > > wrote: > > > > > > > > > >> That is what Im using, I believe. > > > > >> > > > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar < > ***@***.***> > > > wrote: > > > > >> > > > > > >> > > > > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine. > > > > >> > > > > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar < > > > ***@***.*** > > > > >> > > > > > >> > wrote: > > > > >> > > > > > >> > > This seems like memory issue. Pls send me the parameters you > used > > > to > > > > >> > > launch the job. > > > > >> > > > > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown < > > > ***@***.***> > > > > >> > > wrote: > > > > >> > > > > > > >> > >> So I can upload the file, but when I try to load it to Omni > Sci > > > the > > > > >> > >> process dies: > > > > >> > >> > > > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id > TEXT > > > > >> ENCODING > > > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT > ENCODING > > > NONE, > > > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);") > > > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048> > > > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False) > > > > >> > >> Killed > > > > >> > >> > > > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar < > > > ***@***.***> > > > > >> wrote: > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > Yes, separator is '\t' but in your script you mentioned > ',', > > > no? > > > > >> > >> > > > > > >> > >> > — > > > > >> > >> > You are receiving this because you authored the thread. > > > > >> > >> > Reply to this email directly, view it on GitHub < > > > > >> > >> > > > > >> > > > > #13 (comment) > > > > >> >, > > > > >> > >> or unsubscribe < > > > > >> > >> > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA > > > > >> > >> >. > > > > >> > >> > > > > > >> > >> > > > > >> > >> — > > > > >> > >> You are receiving this because you commented. > > > > >> > >> Reply to this email directly, view it on GitHub > > > > >> > >> < > > > > >> > > > > #13 (comment) > > > > >> >, > > > > >> > >> or unsubscribe > > > > >> > >> < > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA > > > > >> > > > > > >> > >> . > > > > >> > >> > > > > >> > > > > > > >> > — > > > > >> > You are receiving this because you authored the thread. > > > > >> > Reply to this email directly, view it on GitHub, or > unsubscribe. > > > > >> > > > > >> — > > > > >> You are receiving this because you commented. > > > > >> Reply to this email directly, view it on GitHub > > > > >> < > > > > #13 (comment) > > > >, > > > > >> or unsubscribe > > > > >> < > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA > > > > > > > > >> . > > > > >> > > > > > > > > > — > > > > You are receiving this because you authored the thread. > > > > Reply to this email directly, view it on GitHub, or unsubscribe. > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA> > . >

jakerbrown · 2020-09-15T15:10:49Z

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.

…

On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote: Then your dataframe is running out of memory to read the whole file at once since it's too big. Please read it in chunks, look into chunksize option while using Pandas dataframe to modify the script: pd.read_csv(filename, chunksize=chunksize)

dkakkar · 2020-09-15T15:13:44Z

I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens. On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <[email protected]> wrote:

…

Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment. > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote: > > Then your dataframe is running out of memory to read the whole file at once > since it's too big. Please read it in chunks, look into chunksize option > while using Pandas dataframe to modify the script: > > pd.read_csv(filename, chunksize=chunksize) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-15T16:57:10Z

Hi Devika, After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:

>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None)

df.head()

>> df.head()

0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N Compared to this when reading in the unzipped file:

>> df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) >> df.head()

0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

…

>> On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: I think it is a memory issue. Please divide the file in smaller size and try again and let's see what happens. On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***> wrote: > Okay, thanks Devika. This might solve one issue but also recall that last > night the process died while reading one of the smaller tables (RI) into > OmniSci, so after successfully loading it into the Python environment. > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote: > > > > Then your dataframe is running out of memory to read the whole file at > once > > since it's too big. Please read it in chunks, look into chunksize option > > while using Pandas dataframe to modify the script: > > > > pd.read_csv(filename, chunksize=chunksize) > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T17:01:56Z

You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe. On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <[email protected]> wrote:

…

Hi Devika, After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below: >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() >>> df.head() 0 1 2 3 4 5 6 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N Compared to this when reading in the unzipped file: >>> df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) >>> df.head() 0 1 2 3 4 5 6 0 AK-787334 AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N >>> > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: > > > I think it is a memory issue. Please divide the file in smaller size and > try again and let's see what happens. > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***> > wrote: > > > Okay, thanks Devika. This might solve one issue but also recall that last > > night the process died while reading one of the smaller tables (RI) into > > OmniSci, so after successfully loading it into the Python environment. > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote: > > > > > > Then your dataframe is running out of memory to read the whole file at > > once > > > since it's too big. Please read it in chunks, look into chunksize option > > > while using Pandas dataframe to modify the script: > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-15T17:09:00Z

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?

…

On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe. On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***> wrote: > Hi Devika, > > After looking at this more one of the issues might have to do with how it > is being read into Python. When I read in the tarred file directly into > python, there is a weird value in the first row and first column > intersection. This does not occur if I first unzip the file and then load > the .csv into Python. Why might this be happening? See below: > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > sep='\t',dtype='unicode',index_col=None, > low_memory='true',compression='gzip', header=None) > df.head() > >>> df.head() > 0 1 2 3 4 5 6 > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N > 1 AK-787334 AK-706032 i r 0 \N \N > 2 AK-787334 AK-647339 i r 0 \N \N > 3 AK-787334 AK-618324 i i 0 \N \N > 4 AK-787334 DC-567085 i i 0 \N \N > > > Compared to this when reading in the unzipped file: > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > >>> df.head() > 0 1 2 3 4 5 6 > 0 AK-787334 AK-709502 i d 0 \N \N > 1 AK-787334 AK-706032 i r 0 \N \N > 2 AK-787334 AK-647339 i r 0 \N \N > 3 AK-787334 AK-618324 i i 0 \N \N > 4 AK-787334 DC-567085 i i 0 \N \N > >>> > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: > > > > > > I think it is a memory issue. Please divide the file in smaller size and > > try again and let's see what happens. > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***> > > wrote: > > > > > Okay, thanks Devika. This might solve one issue but also recall that > last > > > night the process died while reading one of the smaller tables (RI) > into > > > OmniSci, so after successfully loading it into the Python environment. > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> > wrote: > > > > > > > > Then your dataframe is running out of memory to read the whole file > at > > > once > > > > since it's too big. Please read it in chunks, look into chunksize > option > > > > while using Pandas dataframe to modify the script: > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T17:11:15Z

Yes. On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <[email protected]> wrote:

…

Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz? > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: > > > You are ready .tar.gz compressed file but in your dataframe read CSV you > are mentioning .gz compressed. This is causing the problem. Could you look > into how to read .tar.gz compression to dataframe. > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***> > wrote: > > > Hi Devika, > > > > After looking at this more one of the issues might have to do with how it > > is being read into Python. When I read in the tarred file directly into > > python, there is a weird value in the first row and first column > > intersection. This does not occur if I first unzip the file and then load > > the .csv into Python. Why might this be happening? See below: > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > sep='\t',dtype='unicode',index_col=None, > > low_memory='true',compression='gzip', header=None) > > df.head() > > >>> df.head() > > 0 1 2 3 4 5 6 > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N > > 1 AK-787334 AK-706032 i r 0 \N \N > > 2 AK-787334 AK-647339 i r 0 \N \N > > 3 AK-787334 AK-618324 i i 0 \N \N > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > Compared to this when reading in the unzipped file: > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > > >>> df.head() > > 0 1 2 3 4 5 6 > > 0 AK-787334 AK-709502 i d 0 \N \N > > 1 AK-787334 AK-706032 i r 0 \N \N > > 2 AK-787334 AK-647339 i r 0 \N \N > > 3 AK-787334 AK-618324 i i 0 \N \N > > 4 AK-787334 DC-567085 i i 0 \N \N > > >>> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: > > > > > > > > > I think it is a memory issue. Please divide the file in smaller size and > > > try again and let's see what happens. > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < ***@***.***> > > > wrote: > > > > > > > Okay, thanks Devika. This might solve one issue but also recall that > > last > > > > night the process died while reading one of the smaller tables (RI) > > into > > > > OmniSci, so after successfully loading it into the Python environment. > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> > > wrote: > > > > > > > > > > Then your dataframe is running out of memory to read the whole file > > at > > > > once > > > > > since it's too big. Please read it in chunks, look into chunksize > > option > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) >, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-15T17:33:17Z

This actually did not appear to have solved the issue, as we still have the filename in the first row/column:

>> df = pd.read_csv('knn_1000_AK1_2012.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None)

df.head()

>> df.head()

0 1 2 3 4 5 6 0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N

…

On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: You are ready .tar.gz compressed file but in your dataframe read CSV you are mentioning .gz compressed. This is causing the problem. Could you look into how to read .tar.gz compression to dataframe. On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***> wrote: > Hi Devika, > > After looking at this more one of the issues might have to do with how it > is being read into Python. When I read in the tarred file directly into > python, there is a weird value in the first row and first column > intersection. This does not occur if I first unzip the file and then load > the .csv into Python. Why might this be happening? See below: > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > sep='\t',dtype='unicode',index_col=None, > low_memory='true',compression='gzip', header=None) > df.head() > >>> df.head() > 0 1 2 3 4 5 6 > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N > 1 AK-787334 AK-706032 i r 0 \N \N > 2 AK-787334 AK-647339 i r 0 \N \N > 3 AK-787334 AK-618324 i i 0 \N \N > 4 AK-787334 DC-567085 i i 0 \N \N > > > Compared to this when reading in the unzipped file: > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > >>> df.head() > 0 1 2 3 4 5 6 > 0 AK-787334 AK-709502 i d 0 \N \N > 1 AK-787334 AK-706032 i r 0 \N \N > 2 AK-787334 AK-647339 i r 0 \N \N > 3 AK-787334 AK-618324 i i 0 \N \N > 4 AK-787334 DC-567085 i i 0 \N \N > >>> > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: > > > > > > I think it is a memory issue. Please divide the file in smaller size and > > try again and let's see what happens. > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***> > > wrote: > > > > > Okay, thanks Devika. This might solve one issue but also recall that > last > > > night the process died while reading one of the smaller tables (RI) > into > > > OmniSci, so after successfully loading it into the Python environment. > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> > wrote: > > > > > > > > Then your dataframe is running out of memory to read the whole file > at > > > once > > > > since it's too big. Please read it in chunks, look into chunksize > option > > > > while using Pandas dataframe to modify the script: > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>.

jakerbrown · 2020-09-15T17:50:47Z

Hi Devika, You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours. Thanks, Jake

…

On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote: Yes. On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***> wrote: > Thanks ill look into this. Is one potential solution also zipping the file > such that it only has the extension .gz? > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV you > > are mentioning .gz compressed. This is causing the problem. Could you > look > > into how to read .tar.gz compression to dataframe. > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***> > > wrote: > > > > > Hi Devika, > > > > > > After looking at this more one of the issues might have to do with how > it > > > is being read into Python. When I read in the tarred file directly into > > > python, there is a weird value in the first row and first column > > > intersection. This does not occur if I first unzip the file and then > load > > > the .csv into Python. Why might this be happening? See below: > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > sep='\t',dtype='unicode',index_col=None, > > > low_memory='true',compression='gzip', header=None) > > > df.head() > > > >>> df.head() > > > 0 1 2 3 4 5 6 > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N > \N > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > > > >>> df.head() > > > 0 1 2 3 4 5 6 > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > >>> > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> > wrote: > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller size > and > > > > try again and let's see what happens. > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > ***@***.***> > > > > wrote: > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > that > > > last > > > > > night the process died while reading one of the smaller tables (RI) > > > into > > > > > OmniSci, so after successfully loading it into the Python > environment. > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> > > > wrote: > > > > > > > > > > > > Then your dataframe is running out of memory to read the whole > file > > > at > > > > > once > > > > > > since it's too big. Please read it in chunks, look into chunksize > > > option > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub > > > > > < > > > > #13 (comment) > > > >, > > > > > or unsubscribe > > > > > < > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > > > . > > > > > > > > > — > > > > You are receiving this because you authored the thread. > > > > Reply to this email directly, view it on GitHub < > > > > #13 (comment) > >, > > > or unsubscribe < > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > >. > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T18:33:15Z

Please share the file with me.

…

On Tue, Sep 15, 2020, 1:33 PM Jacob Brown ***@***.***> wrote: This actually did not appear to have solved the issue, as we still have the filename in the first row/column: >>> df = pd.read_csv('knn_1000_AK1_2012.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None) df.head() >>> df.head() 0 1 2 3 4 5 6 0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N 1 AK-787334 AK-706032 i r 0 \N \N 2 AK-787334 AK-647339 i r 0 \N \N 3 AK-787334 AK-618324 i i 0 \N \N 4 AK-787334 DC-567085 i i 0 \N \N > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: > > > You are ready .tar.gz compressed file but in your dataframe read CSV you > are mentioning .gz compressed. This is causing the problem. Could you look > into how to read .tar.gz compression to dataframe. > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***> > wrote: > > > Hi Devika, > > > > After looking at this more one of the issues might have to do with how it > > is being read into Python. When I read in the tarred file directly into > > python, there is a weird value in the first row and first column > > intersection. This does not occur if I first unzip the file and then load > > the .csv into Python. Why might this be happening? See below: > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > sep='\t',dtype='unicode',index_col=None, > > low_memory='true',compression='gzip', header=None) > > df.head() > > >>> df.head() > > 0 1 2 3 4 5 6 > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N > > 1 AK-787334 AK-706032 i r 0 \N \N > > 2 AK-787334 AK-647339 i r 0 \N \N > > 3 AK-787334 AK-618324 i i 0 \N \N > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > Compared to this when reading in the unzipped file: > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > > >>> df.head() > > 0 1 2 3 4 5 6 > > 0 AK-787334 AK-709502 i d 0 \N \N > > 1 AK-787334 AK-706032 i r 0 \N \N > > 2 AK-787334 AK-647339 i r 0 \N \N > > 3 AK-787334 AK-618324 i i 0 \N \N > > 4 AK-787334 DC-567085 i i 0 \N \N > > >>> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote: > > > > > > > > > I think it is a memory issue. Please divide the file in smaller size and > > > try again and let's see what happens. > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < ***@***.***> > > > wrote: > > > > > > > Okay, thanks Devika. This might solve one issue but also recall that > > last > > > > night the process died while reading one of the smaller tables (RI) > > into > > > > OmniSci, so after successfully loading it into the Python environment. > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> > > wrote: > > > > > > > > > > Then your dataframe is running out of memory to read the whole file > > at > > > > once > > > > > since it's too big. Please read it in chunks, look into chunksize > > option > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) >, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2E6YFNPDPLTT72NLPDSF6QO3ANCNFSM4RLYKCIA> .

dkakkar · 2020-09-15T18:44:16Z

Sure, take your time. On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <[email protected]> wrote:

…

Hi Devika, You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours. Thanks, Jake > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote: > > > Yes. > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***> > wrote: > > > Thanks ill look into this. Is one potential solution also zipping the file > > such that it only has the extension .gz? > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote: > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV you > > > are mentioning .gz compressed. This is causing the problem. Could you > > look > > > into how to read .tar.gz compression to dataframe. > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < ***@***.***> > > > wrote: > > > > > > > Hi Devika, > > > > > > > > After looking at this more one of the issues might have to do with how > > it > > > > is being read into Python. When I read in the tarred file directly into > > > > python, there is a weird value in the first row and first column > > > > intersection. This does not occur if I first unzip the file and then > > load > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > sep='\t',dtype='unicode',index_col=None, > > > > low_memory='true',compression='gzip', header=None) > > > > df.head() > > > > >>> df.head() > > > > 0 1 2 3 4 5 6 > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N > > \N > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None) > > > > >>> df.head() > > > > 0 1 2 3 4 5 6 > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > >>> > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> > > wrote: > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller size > > and > > > > > try again and let's see what happens. > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > ***@***.***> > > > > > wrote: > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > > that > > > > last > > > > > > night the process died while reading one of the smaller tables (RI) > > > > into > > > > > > OmniSci, so after successfully loading it into the Python > > environment. > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < ***@***.***> > > > > wrote: > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the whole > > file > > > > at > > > > > > once > > > > > > > since it's too big. Please read it in chunks, look into chunksize > > > > option > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > — > > > > > > You are receiving this because you commented. > > > > > > Reply to this email directly, view it on GitHub > > > > > > < > > > > > > #13 (comment) > > > > >, > > > > > > or unsubscribe > > > > > > < > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > > > > > . > > > > > > > > > > > — > > > > > You are receiving this because you authored the thread. > > > > > Reply to this email directly, view it on GitHub < > > > > > > #13 (comment) > > >, > > > > or unsubscribe < > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > > >. > > > > > > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) >, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-15T19:51:45Z

Hi Devika, So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:

>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")

Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded. I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:

>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")

Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet

…

On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote: Sure, take your time. On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***> wrote: > Hi Devika, > > You can disregard my last email, I am still troubleshooting some things > I’ll give a full report in a few hours. > > Thanks, > > Jake > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote: > > > > > > Yes. > > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***> > > wrote: > > > > > Thanks ill look into this. Is one potential solution also zipping the > file > > > such that it only has the extension .gz? > > > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> > wrote: > > > > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV > you > > > > are mentioning .gz compressed. This is causing the problem. Could you > > > look > > > > into how to read .tar.gz compression to dataframe. > > > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > ***@***.***> > > > > wrote: > > > > > > > > > Hi Devika, > > > > > > > > > > After looking at this more one of the issues might have to do with > how > > > it > > > > > is being read into Python. When I read in the tarred file directly > into > > > > > python, there is a weird value in the first row and first column > > > > > intersection. This does not occur if I first unzip the file and > then > > > load > > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > > sep='\t',dtype='unicode',index_col=None, > > > > > low_memory='true',compression='gzip', header=None) > > > > > df.head() > > > > > >>> df.head() > > > > > 0 1 2 3 4 5 6 > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d > 0 \N > > > \N > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > > sep='\t',dtype='unicode',index_col=None, > low_memory='true',header=None) > > > > > >>> df.head() > > > > > 0 1 2 3 4 5 6 > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > >>> > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> > > > wrote: > > > > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller > size > > > and > > > > > > try again and let's see what happens. > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > ***@***.***> > > > > > > wrote: > > > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > > > that > > > > > last > > > > > > > night the process died while reading one of the smaller tables > (RI) > > > > > into > > > > > > > OmniSci, so after successfully loading it into the Python > > > environment. > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > ***@***.***> > > > > > wrote: > > > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the > whole > > > file > > > > > at > > > > > > > once > > > > > > > > since it's too big. Please read it in chunks, look into > chunksize > > > > > option > > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > > > — > > > > > > > You are receiving this because you commented. > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > < > > > > > > > > > #13 (comment) > > > > > >, > > > > > > > or unsubscribe > > > > > > > < > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > > > > > > > . > > > > > > > > > > > > > — > > > > > > You are receiving this because you authored the thread. > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > #13 (comment) > > > >, > > > > > or unsubscribe < > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > > > >. > > > > > > > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub > > > > > < > > > > #13 (comment) > > > >, > > > > > or unsubscribe > > > > > < > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > > > > > > . > > > > > > > > > — > > > > You are receiving this because you authored the thread. > > > > Reply to this email directly, view it on GitHub < > > > > #13 (comment) > >, > > > or unsubscribe < > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > > >. > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T19:54:20Z

What is the data type for source_id? On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <[email protected]> wrote:

…

Hi Devika, So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below: >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded. I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values: >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote: > > > Sure, take your time. > > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***> > wrote: > > > Hi Devika, > > > > You can disregard my last email, I am still troubleshooting some things > > I’ll give a full report in a few hours. > > > > Thanks, > > > > Jake > > > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote: > > > > > > > > > Yes. > > > > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < ***@***.***> > > > wrote: > > > > > > > Thanks ill look into this. Is one potential solution also zipping the > > file > > > > such that it only has the extension .gz? > > > > > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> > > wrote: > > > > > > > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV > > you > > > > > are mentioning .gz compressed. This is causing the problem. Could you > > > > look > > > > > into how to read .tar.gz compression to dataframe. > > > > > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > ***@***.***> > > > > > wrote: > > > > > > > > > > > Hi Devika, > > > > > > > > > > > > After looking at this more one of the issues might have to do with > > how > > > > it > > > > > > is being read into Python. When I read in the tarred file directly > > into > > > > > > python, there is a weird value in the first row and first column > > > > > > intersection. This does not occur if I first unzip the file and > > then > > > > load > > > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > > > low_memory='true',compression='gzip', header=None) > > > > > > df.head() > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d > > 0 \N > > > > \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > > > sep='\t',dtype='unicode',index_col=None, > > low_memory='true',header=None) > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > >>> > > > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < ***@***.***> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller > > size > > > > and > > > > > > > try again and let's see what happens. > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > > ***@***.***> > > > > > > > wrote: > > > > > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > > > > that > > > > > > last > > > > > > > > night the process died while reading one of the smaller tables > > (RI) > > > > > > into > > > > > > > > OmniSci, so after successfully loading it into the Python > > > > environment. > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > ***@***.***> > > > > > > wrote: > > > > > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the > > whole > > > > file > > > > > > at > > > > > > > > once > > > > > > > > > since it's too big. Please read it in chunks, look into > > chunksize > > > > > > option > > > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > > > > > — > > > > > > > > You are receiving this because you commented. > > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > > < > > > > > > > > > > > > #13 (comment) > > > > > > >, > > > > > > > > or unsubscribe > > > > > > > > < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > — > > > > > > > You are receiving this because you authored the thread. > > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > > > > #13 (comment) > > > > >, > > > > > > or unsubscribe < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > > > > >. > > > > > > > > > > > > > > > > > > > — > > > > > > You are receiving this because you commented. > > > > > > Reply to this email directly, view it on GitHub > > > > > > < > > > > > > #13 (comment) > > > > >, > > > > > > or unsubscribe > > > > > > < > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > > > > > > > > . > > > > > > > > > > > — > > > > > You are receiving this because you authored the thread. > > > > > Reply to this email directly, view it on GitHub < > > > > > > #13 (comment) > > >, > > > > or unsubscribe < > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > > > >. > > > > > > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) >, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA> .

dkakkar · 2020-09-15T19:55:05Z

The data type for source_id is STR

…

On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> wrote: What is the data type for source_id? On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.*** ***@***.***>> wrote: Hi Devika, So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below: >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded. I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values: >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** ***@***.***>> wrote: > > > Sure, take your time. > > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.*** ***@***.***>> > wrote: > > > Hi Devika, > > > > You can disregard my last email, I am still troubleshooting some things > > I’ll give a full report in a few hours. > > > > Thanks, > > > > Jake > > > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.*** ***@***.***>> wrote: > > > > > > > > > Yes. > > > > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.*** ***@***.***>> > > > wrote: > > > > > > > Thanks ill look into this. Is one potential solution also zipping the > > file > > > > such that it only has the extension .gz? > > > > > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.*** ***@***.***>> > > wrote: > > > > > > > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV > > you > > > > > are mentioning .gz compressed. This is causing the problem. Could you > > > > look > > > > > into how to read .tar.gz compression to dataframe. > > > > > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > ***@***.*** ***@***.***>> > > > > > wrote: > > > > > > > > > > > Hi Devika, > > > > > > > > > > > > After looking at this more one of the issues might have to do with > > how > > > > it > > > > > > is being read into Python. When I read in the tarred file directly > > into > > > > > > python, there is a weird value in the first row and first column > > > > > > intersection. This does not occur if I first unzip the file and > > then > > > > load > > > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > > > low_memory='true',compression='gzip', header=None) > > > > > > df.head() > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d > > 0 \N > > > > \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > > > sep='\t',dtype='unicode',index_col=None, > > low_memory='true',header=None) > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > >>> > > > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.*** ***@***.***>> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller > > size > > > > and > > > > > > > try again and let's see what happens. > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > > ***@***.*** ***@***.***>> > > > > > > > wrote: > > > > > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > > > > that > > > > > > last > > > > > > > > night the process died while reading one of the smaller tables > > (RI) > > > > > > into > > > > > > > > OmniSci, so after successfully loading it into the Python > > > > environment. > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > ***@***.*** ***@***.***>> > > > > > > wrote: > > > > > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the > > whole > > > > file > > > > > > at > > > > > > > > once > > > > > > > > > since it's too big. Please read it in chunks, look into > > chunksize > > > > > > option > > > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > > > > > — > > > > > > > > You are receiving this because you commented. > > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > > < > > > > > > > > > > > > #13 (comment) <#13 (comment)> > > > > > > >, > > > > > > > > or unsubscribe > > > > > > > > < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA> > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > — > > > > > > > You are receiving this because you authored the thread. > > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > > > > #13 (comment) <#13 (comment)> > > > > >, > > > > > > or unsubscribe < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA> > > > > > > >. > > > > > > > > > > > > > > > > > > > — > > > > > > You are receiving this because you commented. > > > > > > Reply to this email directly, view it on GitHub > > > > > > < > > > > > > #13 (comment) <#13 (comment)> > > > > >, > > > > > > or unsubscribe > > > > > > < > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA> > > > > > > > > > > > . > > > > > > > > > > > — > > > > > You are receiving this because you authored the thread. > > > > > Reply to this email directly, view it on GitHub < > > > > > > #13 (comment) <#13 (comment)> > > >, > > > > or unsubscribe < > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA> > > > > >. > > > > > > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) <#13 (comment)> > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA> > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) <#13 (comment)>>, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA> > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#13 (comment) <#13 (comment)>>, > > or unsubscribe > > <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>> > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>>. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T19:55:55Z

Here is the code used to make the table: conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, dist FLOAT, dpost FLOAT, rpost FLOAT);") conn.load_table_columnar("knn", df,preserve_index=False)

…

On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> wrote: What is the data type for source_id? On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.*** ***@***.***>> wrote: Hi Devika, So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below: >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded. I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values: >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** ***@***.***>> wrote: > > > Sure, take your time. > > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.*** ***@***.***>> > wrote: > > > Hi Devika, > > > > You can disregard my last email, I am still troubleshooting some things > > I’ll give a full report in a few hours. > > > > Thanks, > > > > Jake > > > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.*** ***@***.***>> wrote: > > > > > > > > > Yes. > > > > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.*** ***@***.***>> > > > wrote: > > > > > > > Thanks ill look into this. Is one potential solution also zipping the > > file > > > > such that it only has the extension .gz? > > > > > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.*** ***@***.***>> > > wrote: > > > > > > > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe read CSV > > you > > > > > are mentioning .gz compressed. This is causing the problem. Could you > > > > look > > > > > into how to read .tar.gz compression to dataframe. > > > > > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > ***@***.*** ***@***.***>> > > > > > wrote: > > > > > > > > > > > Hi Devika, > > > > > > > > > > > > After looking at this more one of the issues might have to do with > > how > > > > it > > > > > > is being read into Python. When I read in the tarred file directly > > into > > > > > > python, there is a weird value in the first row and first column > > > > > > intersection. This does not occur if I first unzip the file and > > then > > > > load > > > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > > > low_memory='true',compression='gzip', header=None) > > > > > > df.head() > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d > > 0 \N > > > > \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > > > sep='\t',dtype='unicode',index_col=None, > > low_memory='true',header=None) > > > > > > >>> df.head() > > > > > > 0 1 2 3 4 5 6 > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > >>> > > > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.*** ***@***.***>> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in smaller > > size > > > > and > > > > > > > try again and let's see what happens. > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > > ***@***.*** ***@***.***>> > > > > > > > wrote: > > > > > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also recall > > > > that > > > > > > last > > > > > > > > night the process died while reading one of the smaller tables > > (RI) > > > > > > into > > > > > > > > OmniSci, so after successfully loading it into the Python > > > > environment. > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > ***@***.*** ***@***.***>> > > > > > > wrote: > > > > > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the > > whole > > > > file > > > > > > at > > > > > > > > once > > > > > > > > > since it's too big. Please read it in chunks, look into > > chunksize > > > > > > option > > > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > > > > > — > > > > > > > > You are receiving this because you commented. > > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > > < > > > > > > > > > > > > #13 (comment) <#13 (comment)> > > > > > > >, > > > > > > > > or unsubscribe > > > > > > > > < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA> > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > — > > > > > > > You are receiving this because you authored the thread. > > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > > > > #13 (comment) <#13 (comment)> > > > > >, > > > > > > or unsubscribe < > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA> > > > > > > >. > > > > > > > > > > > > > > > > > > > — > > > > > > You are receiving this because you commented. > > > > > > Reply to this email directly, view it on GitHub > > > > > > < > > > > > > #13 (comment) <#13 (comment)> > > > > >, > > > > > > or unsubscribe > > > > > > < > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA> > > > > > > > > > > > . > > > > > > > > > > > — > > > > > You are receiving this because you authored the thread. > > > > > Reply to this email directly, view it on GitHub < > > > > > > #13 (comment) <#13 (comment)> > > >, > > > > or unsubscribe < > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA> > > > > >. > > > > > > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub > > > > < > > #13 (comment) <#13 (comment)> > > >, > > > > or unsubscribe > > > > < > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA> > > > > > > > . > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) <#13 (comment)>>, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA> > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#13 (comment) <#13 (comment)>>, > > or unsubscribe > > <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>> > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>>. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T19:56:45Z

Please use TEXT ENCODING DICT wherever you define it.

…

On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> wrote: The data type for source_id is STR On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> wrote: What is the data type for source_id? On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***> wrote: > > Hi Devika, > > So I have figured out how to handle reading in the zipped files, and I > have been able to read in some of the smaller files to both Python and > OmniSci. The issues I am running into now involve running the modeling code > you provided, as am getting errors related to grouping on string columns. > You can see that output below: > > >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) > as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") > Traceback (most recent call last): > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 118, in execute > at_most_n=-1, > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1755, in sql_execute > return self.recv_sql_execute() > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1784, in recv_sql_execute > raise result.e > omnisci.thrift.ttypes.TOmniSciException: > TOmniSciException(error_msg='Exception: Cannot group by string columns > which are not dictionary encoded.') > > The above exception was the direct cause of the following exception: > > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > line 390, in execute > return c.execute(operation, parameters=parameters) > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 121, in execute > raise _translate_exception(e) from e > pymapd.exceptions.Error: Exception: Cannot group by string columns which > are not dictionary encoded. > > > > > > I also got an error that I could not join tables using TEXT type > variables in OmniSci. This occurred when I was trying to merge in the new > rpost and dpost values: > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg > ON knn.neighbor_id = mrg.neighbor_id);") > Traceback (most recent call last): > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 118, in execute > at_most_n=-1, > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1755, in sql_execute > return self.recv_sql_execute() > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1784, in recv_sql_execute > raise result.e > omnisci.thrift.ttypes.TOmniSciException: > TOmniSciException(error_msg='Exception: Projection type TEXT not supported > for outer joins yet') > > The above exception was the direct cause of the following exception: > > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > line 390, in execute > return c.execute(operation, parameters=parameters) > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 121, in execute > raise _translate_exception(e) from e > pymapd.exceptions.Error: Exception: Projection type TEXT not supported > for outer joins yet > > > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote: > > > > > > Sure, take your time. > > > > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***> > > wrote: > > > > > Hi Devika, > > > > > > You can disregard my last email, I am still troubleshooting some > things > > > I’ll give a full report in a few hours. > > > > > > Thanks, > > > > > > Jake > > > > > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> > wrote: > > > > > > > > > > > > Yes. > > > > > > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > ***@***.***> > > > > wrote: > > > > > > > > > Thanks ill look into this. Is one potential solution also zipping > the > > > file > > > > > such that it only has the extension .gz? > > > > > > > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> > > > wrote: > > > > > > > > > > > > > > > > > > You are ready .tar.gz compressed file but in your dataframe > read CSV > > > you > > > > > > are mentioning .gz compressed. This is causing the problem. > Could you > > > > > look > > > > > > into how to read .tar.gz compression to dataframe. > > > > > > > > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > > ***@***.***> > > > > > > wrote: > > > > > > > > > > > > > Hi Devika, > > > > > > > > > > > > > > After looking at this more one of the issues might have to do > with > > > how > > > > > it > > > > > > > is being read into Python. When I read in the tarred file > directly > > > into > > > > > > > python, there is a weird value in the first row and first > column > > > > > > > intersection. This does not occur if I first unzip the file > and > > > then > > > > > load > > > > > > > the .csv into Python. Why might this be happening? See below: > > > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > > > > low_memory='true',compression='gzip', header=None) > > > > > > > df.head() > > > > > > > >>> df.head() > > > > > > > 0 1 2 3 4 5 6 > > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 > i d > > > 0 \N > > > > > \N > > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > > > > > > > > > > > > > > > Compared to this when reading in the unzipped file: > > > > > > > > > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > > > > sep='\t',dtype='unicode',index_col=None, > > > low_memory='true',header=None) > > > > > > > >>> df.head() > > > > > > > 0 1 2 3 4 5 6 > > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > ***@***.***> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > I think it is a memory issue. Please divide the file in > smaller > > > size > > > > > and > > > > > > > > try again and let's see what happens. > > > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > > > ***@***.***> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Okay, thanks Devika. This might solve one issue but also > recall > > > > > that > > > > > > > last > > > > > > > > > night the process died while reading one of the smaller > tables > > > (RI) > > > > > > > into > > > > > > > > > OmniSci, so after successfully loading it into the Python > > > > > environment. > > > > > > > > > > > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > > ***@***.***> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Then your dataframe is running out of memory to read the > > > whole > > > > > file > > > > > > > at > > > > > > > > > once > > > > > > > > > > since it's too big. Please read it in chunks, look into > > > chunksize > > > > > > > option > > > > > > > > > > while using Pandas dataframe to modify the script: > > > > > > > > > > > > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > > > > > > > > > > > > > > > — > > > > > > > > > You are receiving this because you commented. > > > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > > > < > > > > > > > > > > > > > > > > #13 (comment) > > > > > > > >, > > > > > > > > > or unsubscribe > > > > > > > > > < > > > > > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > — > > > > > > > > You are receiving this because you authored the thread. > > > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > > > > > > > > #13 (comment) > > > > > >, > > > > > > > or unsubscribe < > > > > > > > > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > > > > > >. > > > > > > > > > > > > > > > > > > > > > > — > > > > > > > You are receiving this because you commented. > > > > > > > Reply to this email directly, view it on GitHub > > > > > > > < > > > > > > > > > #13 (comment) > > > > > >, > > > > > > > or unsubscribe > > > > > > > < > > > > > > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > > > > > > > > > > . > > > > > > > > > > > > > — > > > > > > You are receiving this because you authored the thread. > > > > > > Reply to this email directly, view it on GitHub < > > > > > > > > > #13 (comment) > > > >, > > > > > or unsubscribe < > > > > > > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > > > > >. > > > > > > > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub > > > > > < > > > > #13 (comment) > > > >, > > > > > or unsubscribe > > > > > < > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > > > > > > > . > > > > > > > > > — > > > > You are receiving this because you authored the thread. > > > > Reply to this email directly, view it on GitHub < > > > > #13 (comment) > >, > > > or unsubscribe < > > > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > > > >. > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA> > . >

jakerbrown · 2020-09-15T23:21:56Z

Thanks Devika, That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?

>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")

Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name') The above exception was the direct cause of the following exception: [jbrown613@boslogin04 ~]$ File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name

…

On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote: Please use TEXT ENCODING DICT wherever you define it. On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> wrote: > The data type for source_id is STR > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> > wrote: > > What is the data type for source_id? > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***> > wrote: > >> >> Hi Devika, >> >> So I have figured out how to handle reading in the zipped files, and I >> have been able to read in some of the smaller files to both Python and >> OmniSci. The issues I am running into now involve running the modeling code >> you provided, as am getting errors related to grouping on string columns. >> You can see that output below: >> >> >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") >> Traceback (most recent call last): >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> line 118, in execute >> at_most_n=-1, >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> line 1755, in sql_execute >> return self.recv_sql_execute() >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> line 1784, in recv_sql_execute >> raise result.e >> omnisci.thrift.ttypes.TOmniSciException: >> TOmniSciException(error_msg='Exception: Cannot group by string columns >> which are not dictionary encoded.') >> >> The above exception was the direct cause of the following exception: >> >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", >> line 390, in execute >> return c.execute(operation, parameters=parameters) >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> line 121, in execute >> raise _translate_exception(e) from e >> pymapd.exceptions.Error: Exception: Cannot group by string columns which >> are not dictionary encoded. >> >> >> >> >> >> I also got an error that I could not join tables using TEXT type >> variables in OmniSci. This occurred when I was trying to merge in the new >> rpost and dpost values: >> >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg >> ON knn.neighbor_id = mrg.neighbor_id);") >> Traceback (most recent call last): >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> line 118, in execute >> at_most_n=-1, >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> line 1755, in sql_execute >> return self.recv_sql_execute() >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> line 1784, in recv_sql_execute >> raise result.e >> omnisci.thrift.ttypes.TOmniSciException: >> TOmniSciException(error_msg='Exception: Projection type TEXT not supported >> for outer joins yet') >> >> The above exception was the direct cause of the following exception: >> >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", >> line 390, in execute >> return c.execute(operation, parameters=parameters) >> File >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> line 121, in execute >> raise _translate_exception(e) from e >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported >> for outer joins yet >> >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote: >> > >> > >> > Sure, take your time. >> > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***> >> > wrote: >> > >> > > Hi Devika, >> > > >> > > You can disregard my last email, I am still troubleshooting some >> things >> > > I’ll give a full report in a few hours. >> > > >> > > Thanks, >> > > >> > > Jake >> > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> >> wrote: >> > > > >> > > > >> > > > Yes. >> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < >> ***@***.***> >> > > > wrote: >> > > > >> > > > > Thanks ill look into this. Is one potential solution also zipping >> the >> > > file >> > > > > such that it only has the extension .gz? >> > > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> >> > > wrote: >> > > > > > >> > > > > > >> > > > > > You are ready .tar.gz compressed file but in your dataframe >> read CSV >> > > you >> > > > > > are mentioning .gz compressed. This is causing the problem. >> Could you >> > > > > look >> > > > > > into how to read .tar.gz compression to dataframe. >> > > > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < >> > > ***@***.***> >> > > > > > wrote: >> > > > > > >> > > > > > > Hi Devika, >> > > > > > > >> > > > > > > After looking at this more one of the issues might have to do >> with >> > > how >> > > > > it >> > > > > > > is being read into Python. When I read in the tarred file >> directly >> > > into >> > > > > > > python, there is a weird value in the first row and first >> column >> > > > > > > intersection. This does not occur if I first unzip the file >> and >> > > then >> > > > > load >> > > > > > > the .csv into Python. Why might this be happening? See below: >> > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', >> > > > > > > sep='\t',dtype='unicode',index_col=None, >> > > > > > > low_memory='true',compression='gzip', header=None) >> > > > > > > df.head() >> > > > > > > >>> df.head() >> > > > > > > 0 1 2 3 4 5 6 >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 >> i d >> > > 0 \N >> > > > > \N >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N >> > > > > > > >> > > > > > > >> > > > > > > Compared to this when reading in the unzipped file: >> > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', >> > > > > > > sep='\t',dtype='unicode',index_col=None, >> > > low_memory='true',header=None) >> > > > > > > >>> df.head() >> > > > > > > 0 1 2 3 4 5 6 >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N >> > > > > > > >>> >> > > > > > > >> > > > > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < >> ***@***.***> >> > > > > wrote: >> > > > > > > > >> > > > > > > > >> > > > > > > > I think it is a memory issue. Please divide the file in >> smaller >> > > size >> > > > > and >> > > > > > > > try again and let's see what happens. >> > > > > > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < >> > > > > ***@***.***> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue but also >> recall >> > > > > that >> > > > > > > last >> > > > > > > > > night the process died while reading one of the smaller >> tables >> > > (RI) >> > > > > > > into >> > > > > > > > > OmniSci, so after successfully loading it into the Python >> > > > > environment. >> > > > > > > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < >> > > ***@***.***> >> > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > Then your dataframe is running out of memory to read the >> > > whole >> > > > > file >> > > > > > > at >> > > > > > > > > once >> > > > > > > > > > since it's too big. Please read it in chunks, look into >> > > chunksize >> > > > > > > option >> > > > > > > > > > while using Pandas dataframe to modify the script: >> > > > > > > > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) >> > > > > > > > > >> > > > > > > > > — >> > > > > > > > > You are receiving this because you commented. >> > > > > > > > > Reply to this email directly, view it on GitHub >> > > > > > > > > < >> > > > > > > >> > > > > >> > > >> #13 (comment) >> > > > > > > >, >> > > > > > > > > or unsubscribe >> > > > > > > > > < >> > > > > > > >> > > > > >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA >> > > > > > > > >> > > > > > > > > . >> > > > > > > > > >> > > > > > > > — >> > > > > > > > You are receiving this because you authored the thread. >> > > > > > > > Reply to this email directly, view it on GitHub < >> > > > > > > >> > > > > >> > > >> #13 (comment) >> > > > > >, >> > > > > > > or unsubscribe < >> > > > > > > >> > > > > >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA >> > > > > > > >. >> > > > > > > > >> > > > > > > >> > > > > > > — >> > > > > > > You are receiving this because you commented. >> > > > > > > Reply to this email directly, view it on GitHub >> > > > > > > < >> > > > > >> > > >> #13 (comment) >> > > > > >, >> > > > > > > or unsubscribe >> > > > > > > < >> > > > > >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA >> > > > > > >> > > > > > > . >> > > > > > > >> > > > > > — >> > > > > > You are receiving this because you authored the thread. >> > > > > > Reply to this email directly, view it on GitHub < >> > > > > >> > > >> #13 (comment) >> > > >, >> > > > > or unsubscribe < >> > > > > >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA >> > > > > >. >> > > > > > >> > > > > >> > > > > — >> > > > > You are receiving this because you commented. >> > > > > Reply to this email directly, view it on GitHub >> > > > > < >> > > >> #13 (comment) >> > > >, >> > > > > or unsubscribe >> > > > > < >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA >> > > > >> > > > > . >> > > > > >> > > > — >> > > > You are receiving this because you authored the thread. >> > > > Reply to this email directly, view it on GitHub < >> > > >> #13 (comment) >> >, >> > > or unsubscribe < >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA >> > > >. >> > > > >> > > >> > > — >> > > You are receiving this because you commented. >> > > Reply to this email directly, view it on GitHub >> > > < >> #13 (comment) >> >, >> > > or unsubscribe >> > > < >> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA >> > >> > > . >> > > >> > — >> > You are receiving this because you authored the thread. >> > Reply to this email directly, view it on GitHub < >> #13 (comment)>, >> or unsubscribe < >> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA >> >. >> > >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#13 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA> >> . >> > > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA>.

dkakkar · 2020-09-15T23:37:18Z

Pls send me column bames for both tables.

…

On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> wrote: Thanks Devika, That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on? >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);") Traceback (most recent call last): File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute at_most_n=-1, File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute return self.recv_sql_execute() File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute raise result.e omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name') The above exception was the direct cause of the following exception: ***@***.*** ~]$ File "<stdin>", line 1, in <module> File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute return c.execute(operation, parameters=parameters) File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute raise _translate_exception(e) from e pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote: > > > Please use TEXT ENCODING DICT wherever you define it. > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> wrote: > > > The data type for source_id is STR > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> > > wrote: > > > > What is the data type for source_id? > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***> > > wrote: > > > >> > >> Hi Devika, > >> > >> So I have figured out how to handle reading in the zipped files, and I > >> have been able to read in some of the smaller files to both Python and > >> OmniSci. The issues I am running into now involve running the modeling code > >> you provided, as am getting errors related to grouping on string columns. > >> You can see that output below: > >> > >> >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);") > >> Traceback (most recent call last): > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> line 118, in execute > >> at_most_n=-1, > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> line 1755, in sql_execute > >> return self.recv_sql_execute() > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> line 1784, in recv_sql_execute > >> raise result.e > >> omnisci.thrift.ttypes.TOmniSciException: > >> TOmniSciException(error_msg='Exception: Cannot group by string columns > >> which are not dictionary encoded.') > >> > >> The above exception was the direct cause of the following exception: > >> > >> Traceback (most recent call last): > >> File "<stdin>", line 1, in <module> > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > >> line 390, in execute > >> return c.execute(operation, parameters=parameters) > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> line 121, in execute > >> raise _translate_exception(e) from e > >> pymapd.exceptions.Error: Exception: Cannot group by string columns which > >> are not dictionary encoded. > >> > >> > >> > >> > >> > >> I also got an error that I could not join tables using TEXT type > >> variables in OmniSci. This occurred when I was trying to merge in the new > >> rpost and dpost values: > >> > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg > >> ON knn.neighbor_id = mrg.neighbor_id);") > >> Traceback (most recent call last): > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> line 118, in execute > >> at_most_n=-1, > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> line 1755, in sql_execute > >> return self.recv_sql_execute() > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> line 1784, in recv_sql_execute > >> raise result.e > >> omnisci.thrift.ttypes.TOmniSciException: > >> TOmniSciException(error_msg='Exception: Projection type TEXT not supported > >> for outer joins yet') > >> > >> The above exception was the direct cause of the following exception: > >> > >> Traceback (most recent call last): > >> File "<stdin>", line 1, in <module> > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > >> line 390, in execute > >> return c.execute(operation, parameters=parameters) > >> File > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> line 121, in execute > >> raise _translate_exception(e) from e > >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported > >> for outer joins yet > >> > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote: > >> > > >> > > >> > Sure, take your time. > >> > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < ***@***.***> > >> > wrote: > >> > > >> > > Hi Devika, > >> > > > >> > > You can disregard my last email, I am still troubleshooting some > >> things > >> > > I’ll give a full report in a few hours. > >> > > > >> > > Thanks, > >> > > > >> > > Jake > >> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> > >> wrote: > >> > > > > >> > > > > >> > > > Yes. > >> > > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > >> ***@***.***> > >> > > > wrote: > >> > > > > >> > > > > Thanks ill look into this. Is one potential solution also zipping > >> the > >> > > file > >> > > > > such that it only has the extension .gz? > >> > > > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < ***@***.***> > >> > > wrote: > >> > > > > > > >> > > > > > > >> > > > > > You are ready .tar.gz compressed file but in your dataframe > >> read CSV > >> > > you > >> > > > > > are mentioning .gz compressed. This is causing the problem. > >> Could you > >> > > > > look > >> > > > > > into how to read .tar.gz compression to dataframe. > >> > > > > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > >> > > ***@***.***> > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Hi Devika, > >> > > > > > > > >> > > > > > > After looking at this more one of the issues might have to do > >> with > >> > > how > >> > > > > it > >> > > > > > > is being read into Python. When I read in the tarred file > >> directly > >> > > into > >> > > > > > > python, there is a weird value in the first row and first > >> column > >> > > > > > > intersection. This does not occur if I first unzip the file > >> and > >> > > then > >> > > > > load > >> > > > > > > the .csv into Python. Why might this be happening? See below: > >> > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > >> > > > > > > low_memory='true',compression='gzip', header=None) > >> > > > > > > df.head() > >> > > > > > > >>> df.head() > >> > > > > > > 0 1 2 3 4 5 6 > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 > >> i d > >> > > 0 \N > >> > > > > \N > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > >> > > > > > > > >> > > > > > > > >> > > > > > > Compared to this when reading in the unzipped file: > >> > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > >> > > low_memory='true',header=None) > >> > > > > > > >>> df.head() > >> > > > > > > 0 1 2 3 4 5 6 > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > >> > > > > > > >>> > >> > > > > > > > >> > > > > > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > >> ***@***.***> > >> > > > > wrote: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > I think it is a memory issue. Please divide the file in > >> smaller > >> > > size > >> > > > > and > >> > > > > > > > try again and let's see what happens. > >> > > > > > > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > >> > > > > ***@***.***> > >> > > > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue but also > >> recall > >> > > > > that > >> > > > > > > last > >> > > > > > > > > night the process died while reading one of the smaller > >> tables > >> > > (RI) > >> > > > > > > into > >> > > > > > > > > OmniSci, so after successfully loading it into the Python > >> > > > > environment. > >> > > > > > > > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > >> > > ***@***.***> > >> > > > > > > wrote: > >> > > > > > > > > > > >> > > > > > > > > > Then your dataframe is running out of memory to read the > >> > > whole > >> > > > > file > >> > > > > > > at > >> > > > > > > > > once > >> > > > > > > > > > since it's too big. Please read it in chunks, look into > >> > > chunksize > >> > > > > > > option > >> > > > > > > > > > while using Pandas dataframe to modify the script: > >> > > > > > > > > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > >> > > > > > > > > > >> > > > > > > > > — > >> > > > > > > > > You are receiving this because you commented. > >> > > > > > > > > Reply to this email directly, view it on GitHub > >> > > > > > > > > < > >> > > > > > > > >> > > > > > >> > > > >> #13 (comment) > >> > > > > > > >, > >> > > > > > > > > or unsubscribe > >> > > > > > > > > < > >> > > > > > > > >> > > > > > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > >> > > > > > > > > >> > > > > > > > > . > >> > > > > > > > > > >> > > > > > > > — > >> > > > > > > > You are receiving this because you authored the thread. > >> > > > > > > > Reply to this email directly, view it on GitHub < > >> > > > > > > > >> > > > > > >> > > > >> #13 (comment) > >> > > > > >, > >> > > > > > > or unsubscribe < > >> > > > > > > > >> > > > > > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > >> > > > > > > >. > >> > > > > > > > > >> > > > > > > > >> > > > > > > — > >> > > > > > > You are receiving this because you commented. > >> > > > > > > Reply to this email directly, view it on GitHub > >> > > > > > > < > >> > > > > > >> > > > >> #13 (comment) > >> > > > > >, > >> > > > > > > or unsubscribe > >> > > > > > > < > >> > > > > > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > >> > > > > > > >> > > > > > > . > >> > > > > > > > >> > > > > > — > >> > > > > > You are receiving this because you authored the thread. > >> > > > > > Reply to this email directly, view it on GitHub < > >> > > > > > >> > > > >> #13 (comment) > >> > > >, > >> > > > > or unsubscribe < > >> > > > > > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > >> > > > > >. > >> > > > > > > >> > > > > > >> > > > > — > >> > > > > You are receiving this because you commented. > >> > > > > Reply to this email directly, view it on GitHub > >> > > > > < > >> > > > >> #13 (comment) > >> > > >, > >> > > > > or unsubscribe > >> > > > > < > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > >> > > > > >> > > > > . > >> > > > > > >> > > > — > >> > > > You are receiving this because you authored the thread. > >> > > > Reply to this email directly, view it on GitHub < > >> > > > >> #13 (comment) > >> >, > >> > > or unsubscribe < > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > >> > > >. > >> > > > > >> > > > >> > > — > >> > > You are receiving this because you commented. > >> > > Reply to this email directly, view it on GitHub > >> > > < > >> #13 (comment) > >> >, > >> > > or unsubscribe > >> > > < > >> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > >> > > >> > > . > >> > > > >> > — > >> > You are receiving this because you authored the thread. > >> > Reply to this email directly, view it on GitHub < > >> #13 (comment) >, > >> or unsubscribe < > >> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > >> >. > >> > > >> > >> — > >> You are receiving this because you commented. > >> Reply to this email directly, view it on GitHub > >> < #13 (comment) >, > >> or unsubscribe > >> < https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA > > >> . > >> > > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA> .

jakerbrown · 2020-09-16T00:02:44Z

Knn: source_id neighbor_id dist mrg: dpost rpost neighbor_id

…

On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote: Pls send me column bames for both tables. On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> wrote: > Thanks Devika, > > That seems to fix those issues. I think the remaining issue is the > potential memory issue, which I can solve by outputting smaller files, and > an issue when joining in sql/Omnisci. I am running up against a unique > constraint error that I do not understand. The rpost/dpost data frame that > I am joining to the knn output will have multiple matches, since I am > joining it to neighbor_id, and sometimes people share neighbors. There are > no duplicates in the rpost/dpost data frame, as it contains one row for > each registered voter (or each potential neighbor, if you will). This kind > of merge/join would not be a problem using similar functions in python/R, > but seems to run up against a join difficulty in sql. Can you clarify what > is going on? > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON > knn.neighbor_id = mrg.neighbor_id);") > Traceback (most recent call last): > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 118, in execute > at_most_n=-1, > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1755, in sql_execute > return self.recv_sql_execute() > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > line 1784, in recv_sql_execute > raise result.e > omnisci.thrift.ttypes.TOmniSciException: > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint > failed: mapd_columns.tableid, mapd_columns.name') > > The above exception was the direct cause of the following exception: > > ***@***.*** ~]$ > File "<stdin>", line 1, in <module> > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > line 390, in execute > return c.execute(operation, parameters=parameters) > File > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > line 121, in execute > raise _translate_exception(e) from e > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint > failed: mapd_columns.tableid, mapd_columns.name > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote: > > > > > > Please use TEXT ENCODING DICT wherever you define it. > > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> > wrote: > > > > > The data type for source_id is STR > > > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> > > > wrote: > > > > > > What is the data type for source_id? > > > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***> > > > wrote: > > > > > >> > > >> Hi Devika, > > >> > > >> So I have figured out how to handle reading in the zipped files, and I > > >> have been able to read in some of the smaller files to both Python and > > >> OmniSci. The issues I am running into now involve running the > modeling code > > >> you provided, as am getting errors related to grouping on string > columns. > > >> You can see that output below: > > >> > > >> >>> conn.execute("Create table results as (SELECT source_id, > AVG(dpost) > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY > source_id);") > > >> Traceback (most recent call last): > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > >> line 118, in execute > > >> at_most_n=-1, > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > >> line 1755, in sql_execute > > >> return self.recv_sql_execute() > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > >> line 1784, in recv_sql_execute > > >> raise result.e > > >> omnisci.thrift.ttypes.TOmniSciException: > > >> TOmniSciException(error_msg='Exception: Cannot group by string columns > > >> which are not dictionary encoded.') > > >> > > >> The above exception was the direct cause of the following exception: > > >> > > >> Traceback (most recent call last): > > >> File "<stdin>", line 1, in <module> > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > >> line 390, in execute > > >> return c.execute(operation, parameters=parameters) > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > >> line 121, in execute > > >> raise _translate_exception(e) from e > > >> pymapd.exceptions.Error: Exception: Cannot group by string columns > which > > >> are not dictionary encoded. > > >> > > >> > > >> > > >> > > >> > > >> I also got an error that I could not join tables using TEXT type > > >> variables in OmniSci. This occurred when I was trying to merge in the > new > > >> rpost and dpost values: > > >> > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN > mrg > > >> ON knn.neighbor_id = mrg.neighbor_id);") > > >> Traceback (most recent call last): > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > >> line 118, in execute > > >> at_most_n=-1, > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > >> line 1755, in sql_execute > > >> return self.recv_sql_execute() > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > >> line 1784, in recv_sql_execute > > >> raise result.e > > >> omnisci.thrift.ttypes.TOmniSciException: > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not > supported > > >> for outer joins yet') > > >> > > >> The above exception was the direct cause of the following exception: > > >> > > >> Traceback (most recent call last): > > >> File "<stdin>", line 1, in <module> > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > >> line 390, in execute > > >> return c.execute(operation, parameters=parameters) > > >> File > > >> > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > >> line 121, in execute > > >> raise _translate_exception(e) from e > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported > > >> for outer joins yet > > >> > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> > wrote: > > >> > > > >> > > > >> > Sure, take your time. > > >> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < > ***@***.***> > > >> > wrote: > > >> > > > >> > > Hi Devika, > > >> > > > > >> > > You can disregard my last email, I am still troubleshooting some > > >> things > > >> > > I’ll give a full report in a few hours. > > >> > > > > >> > > Thanks, > > >> > > > > >> > > Jake > > >> > > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> > > >> wrote: > > >> > > > > > >> > > > > > >> > > > Yes. > > >> > > > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > > >> ***@***.***> > > >> > > > wrote: > > >> > > > > > >> > > > > Thanks ill look into this. Is one potential solution also > zipping > > >> the > > >> > > file > > >> > > > > such that it only has the extension .gz? > > >> > > > > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < > ***@***.***> > > >> > > wrote: > > >> > > > > > > > >> > > > > > > > >> > > > > > You are ready .tar.gz compressed file but in your dataframe > > >> read CSV > > >> > > you > > >> > > > > > are mentioning .gz compressed. This is causing the problem. > > >> Could you > > >> > > > > look > > >> > > > > > into how to read .tar.gz compression to dataframe. > > >> > > > > > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > >> > > ***@***.***> > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Hi Devika, > > >> > > > > > > > > >> > > > > > > After looking at this more one of the issues might have > to do > > >> with > > >> > > how > > >> > > > > it > > >> > > > > > > is being read into Python. When I read in the tarred file > > >> directly > > >> > > into > > >> > > > > > > python, there is a weird value in the first row and first > > >> column > > >> > > > > > > intersection. This does not occur if I first unzip the > file > > >> and > > >> > > then > > >> > > > > load > > >> > > > > > > the .csv into Python. Why might this be happening? See > below: > > >> > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > >> > > > > > > low_memory='true',compression='gzip', header=None) > > >> > > > > > > df.head() > > >> > > > > > > >>> df.head() > > >> > > > > > > 0 1 2 3 4 5 6 > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... > AK-709502 > > >> i d > > >> > > 0 \N > > >> > > > > \N > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > Compared to this when reading in the unzipped file: > > >> > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > >> > > low_memory='true',header=None) > > >> > > > > > > >>> df.head() > > >> > > > > > > 0 1 2 3 4 5 6 > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > >> > > > > > > >>> > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > > >> ***@***.***> > > >> > > > > wrote: > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > I think it is a memory issue. Please divide the file in > > >> smaller > > >> > > size > > >> > > > > and > > >> > > > > > > > try again and let's see what happens. > > >> > > > > > > > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > >> > > > > ***@***.***> > > >> > > > > > > > wrote: > > >> > > > > > > > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue but > also > > >> recall > > >> > > > > that > > >> > > > > > > last > > >> > > > > > > > > night the process died while reading one of the > smaller > > >> tables > > >> > > (RI) > > >> > > > > > > into > > >> > > > > > > > > OmniSci, so after successfully loading it into the > Python > > >> > > > > environment. > > >> > > > > > > > > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > >> > > ***@***.***> > > >> > > > > > > wrote: > > >> > > > > > > > > > > > >> > > > > > > > > > Then your dataframe is running out of memory to > read the > > >> > > whole > > >> > > > > file > > >> > > > > > > at > > >> > > > > > > > > once > > >> > > > > > > > > > since it's too big. Please read it in chunks, look > into > > >> > > chunksize > > >> > > > > > > option > > >> > > > > > > > > > while using Pandas dataframe to modify the script: > > >> > > > > > > > > > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > >> > > > > > > > > > > >> > > > > > > > > — > > >> > > > > > > > > You are receiving this because you commented. > > >> > > > > > > > > Reply to this email directly, view it on GitHub > > >> > > > > > > > > < > > >> > > > > > > > > >> > > > > > > >> > > > > >> > #13 (comment) > > >> > > > > > > >, > > >> > > > > > > > > or unsubscribe > > >> > > > > > > > > < > > >> > > > > > > > > >> > > > > > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > >> > > > > > > > > > >> > > > > > > > > . > > >> > > > > > > > > > > >> > > > > > > > — > > >> > > > > > > > You are receiving this because you authored the thread. > > >> > > > > > > > Reply to this email directly, view it on GitHub < > > >> > > > > > > > > >> > > > > > > >> > > > > >> > #13 (comment) > > >> > > > > >, > > >> > > > > > > or unsubscribe < > > >> > > > > > > > > >> > > > > > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > >> > > > > > > >. > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > — > > >> > > > > > > You are receiving this because you commented. > > >> > > > > > > Reply to this email directly, view it on GitHub > > >> > > > > > > < > > >> > > > > > > >> > > > > >> > #13 (comment) > > >> > > > > >, > > >> > > > > > > or unsubscribe > > >> > > > > > > < > > >> > > > > > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > >> > > > > > > > >> > > > > > > . > > >> > > > > > > > > >> > > > > > — > > >> > > > > > You are receiving this because you authored the thread. > > >> > > > > > Reply to this email directly, view it on GitHub < > > >> > > > > > > >> > > > > >> > #13 (comment) > > >> > > >, > > >> > > > > or unsubscribe < > > >> > > > > > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > >> > > > > >. > > >> > > > > > > > >> > > > > > > >> > > > > — > > >> > > > > You are receiving this because you commented. > > >> > > > > Reply to this email directly, view it on GitHub > > >> > > > > < > > >> > > > > >> > #13 (comment) > > >> > > >, > > >> > > > > or unsubscribe > > >> > > > > < > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > >> > > > > > >> > > > > . > > >> > > > > > > >> > > > — > > >> > > > You are receiving this because you authored the thread. > > >> > > > Reply to this email directly, view it on GitHub < > > >> > > > > >> > #13 (comment) > > >> >, > > >> > > or unsubscribe < > > >> > > > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > > >> > > >. > > >> > > > > > >> > > > > >> > > — > > >> > > You are receiving this because you commented. > > >> > > Reply to this email directly, view it on GitHub > > >> > > < > > >> > #13 (comment) > > >> >, > > >> > > or unsubscribe > > >> > > < > > >> > https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > > >> > > > >> > > . > > >> > > > > >> > — > > >> > You are receiving this because you authored the thread. > > >> > Reply to this email directly, view it on GitHub < > > >> > #13 (comment) > >, > > >> or unsubscribe < > > >> > https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > > >> >. > > >> > > > >> > > >> — > > >> You are receiving this because you commented. > > >> Reply to this email directly, view it on GitHub > > >> < > #13 (comment) > >, > > >> or unsubscribe > > >> < > https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA > > > > >> . > > >> > > > > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA>.

dkakkar · 2020-09-16T00:08:50Z

Try: *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);* On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown <[email protected]> wrote:

…

Knn: source_id neighbor_id dist mrg: dpost rpost neighbor_id > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote: > > > Pls send me column bames for both tables. > > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> wrote: > > > Thanks Devika, > > > > That seems to fix those issues. I think the remaining issue is the > > potential memory issue, which I can solve by outputting smaller files, and > > an issue when joining in sql/Omnisci. I am running up against a unique > > constraint error that I do not understand. The rpost/dpost data frame that > > I am joining to the knn output will have multiple matches, since I am > > joining it to neighbor_id, and sometimes people share neighbors. There are > > no duplicates in the rpost/dpost data frame, as it contains one row for > > each registered voter (or each potential neighbor, if you will). This kind > > of merge/join would not be a problem using similar functions in python/R, > > but seems to run up against a join difficulty in sql. Can you clarify what > > is going on? > > > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON > > knn.neighbor_id = mrg.neighbor_id);") > > Traceback (most recent call last): > > File > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > line 118, in execute > > at_most_n=-1, > > File > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > line 1755, in sql_execute > > return self.recv_sql_execute() > > File > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > line 1784, in recv_sql_execute > > raise result.e > > omnisci.thrift.ttypes.TOmniSciException: > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint > > failed: mapd_columns.tableid, mapd_columns.name') > > > > The above exception was the direct cause of the following exception: > > > > ***@***.*** ~]$ > > File "<stdin>", line 1, in <module> > > File > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > line 390, in execute > > return c.execute(operation, parameters=parameters) > > File > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > line 121, in execute > > raise _translate_exception(e) from e > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint > > failed: mapd_columns.tableid, mapd_columns.name > > > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote: > > > > > > > > > Please use TEXT ENCODING DICT wherever you define it. > > > > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> > > wrote: > > > > > > > The data type for source_id is STR > > > > > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar < ***@***.***> > > > > wrote: > > > > > > > > What is the data type for source_id? > > > > > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < ***@***.***> > > > > wrote: > > > > > > > >> > > > >> Hi Devika, > > > >> > > > >> So I have figured out how to handle reading in the zipped files, and I > > > >> have been able to read in some of the smaller files to both Python and > > > >> OmniSci. The issues I am running into now involve running the > > modeling code > > > >> you provided, as am getting errors related to grouping on string > > columns. > > > >> You can see that output below: > > > >> > > > >> >>> conn.execute("Create table results as (SELECT source_id, > > AVG(dpost) > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY > > source_id);") > > > >> Traceback (most recent call last): > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > >> line 118, in execute > > > >> at_most_n=-1, > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > >> line 1755, in sql_execute > > > >> return self.recv_sql_execute() > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > >> line 1784, in recv_sql_execute > > > >> raise result.e > > > >> omnisci.thrift.ttypes.TOmniSciException: > > > >> TOmniSciException(error_msg='Exception: Cannot group by string columns > > > >> which are not dictionary encoded.') > > > >> > > > >> The above exception was the direct cause of the following exception: > > > >> > > > >> Traceback (most recent call last): > > > >> File "<stdin>", line 1, in <module> > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > > >> line 390, in execute > > > >> return c.execute(operation, parameters=parameters) > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > >> line 121, in execute > > > >> raise _translate_exception(e) from e > > > >> pymapd.exceptions.Error: Exception: Cannot group by string columns > > which > > > >> are not dictionary encoded. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> I also got an error that I could not join tables using TEXT type > > > >> variables in OmniSci. This occurred when I was trying to merge in the > > new > > > >> rpost and dpost values: > > > >> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN > > mrg > > > >> ON knn.neighbor_id = mrg.neighbor_id);") > > > >> Traceback (most recent call last): > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > >> line 118, in execute > > > >> at_most_n=-1, > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > >> line 1755, in sql_execute > > > >> return self.recv_sql_execute() > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > >> line 1784, in recv_sql_execute > > > >> raise result.e > > > >> omnisci.thrift.ttypes.TOmniSciException: > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not > > supported > > > >> for outer joins yet') > > > >> > > > >> The above exception was the direct cause of the following exception: > > > >> > > > >> Traceback (most recent call last): > > > >> File "<stdin>", line 1, in <module> > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > > >> line 390, in execute > > > >> return c.execute(operation, parameters=parameters) > > > >> File > > > >> > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > >> line 121, in execute > > > >> raise _translate_exception(e) from e > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported > > > >> for outer joins yet > > > >> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> > > wrote: > > > >> > > > > >> > > > > >> > Sure, take your time. > > > >> > > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < > > ***@***.***> > > > >> > wrote: > > > >> > > > > >> > > Hi Devika, > > > >> > > > > > >> > > You can disregard my last email, I am still troubleshooting some > > > >> things > > > >> > > I’ll give a full report in a few hours. > > > >> > > > > > >> > > Thanks, > > > >> > > > > > >> > > Jake > > > >> > > > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar < ***@***.***> > > > >> wrote: > > > >> > > > > > > >> > > > > > > >> > > > Yes. > > > >> > > > > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > > > >> ***@***.***> > > > >> > > > wrote: > > > >> > > > > > > >> > > > > Thanks ill look into this. Is one potential solution also > > zipping > > > >> the > > > >> > > file > > > >> > > > > such that it only has the extension .gz? > > > >> > > > > > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < > > ***@***.***> > > > >> > > wrote: > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > You are ready .tar.gz compressed file but in your dataframe > > > >> read CSV > > > >> > > you > > > >> > > > > > are mentioning .gz compressed. This is causing the problem. > > > >> Could you > > > >> > > > > look > > > >> > > > > > into how to read .tar.gz compression to dataframe. > > > >> > > > > > > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > > >> > > ***@***.***> > > > >> > > > > > wrote: > > > >> > > > > > > > > >> > > > > > > Hi Devika, > > > >> > > > > > > > > > >> > > > > > > After looking at this more one of the issues might have > > to do > > > >> with > > > >> > > how > > > >> > > > > it > > > >> > > > > > > is being read into Python. When I read in the tarred file > > > >> directly > > > >> > > into > > > >> > > > > > > python, there is a weird value in the first row and first > > > >> column > > > >> > > > > > > intersection. This does not occur if I first unzip the > > file > > > >> and > > > >> > > then > > > >> > > > > load > > > >> > > > > > > the .csv into Python. Why might this be happening? See > > below: > > > >> > > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > > >> > > > > > > low_memory='true',compression='gzip', header=None) > > > >> > > > > > > df.head() > > > >> > > > > > > >>> df.head() > > > >> > > > > > > 0 1 2 3 4 5 6 > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... > > AK-709502 > > > >> i d > > > >> > > 0 \N > > > >> > > > > \N > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Compared to this when reading in the unzipped file: > > > >> > > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > > >> > > low_memory='true',header=None) > > > >> > > > > > > >>> df.head() > > > >> > > > > > > 0 1 2 3 4 5 6 > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > >> > > > > > > >>> > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > > > >> ***@***.***> > > > >> > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > I think it is a memory issue. Please divide the file in > > > >> smaller > > > >> > > size > > > >> > > > > and > > > >> > > > > > > > try again and let's see what happens. > > > >> > > > > > > > > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > >> > > > > ***@***.***> > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue but > > also > > > >> recall > > > >> > > > > that > > > >> > > > > > > last > > > >> > > > > > > > > night the process died while reading one of the > > smaller > > > >> tables > > > >> > > (RI) > > > >> > > > > > > into > > > >> > > > > > > > > OmniSci, so after successfully loading it into the > > Python > > > >> > > > > environment. > > > >> > > > > > > > > > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > > >> > > ***@***.***> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > Then your dataframe is running out of memory to > > read the > > > >> > > whole > > > >> > > > > file > > > >> > > > > > > at > > > >> > > > > > > > > once > > > >> > > > > > > > > > since it's too big. Please read it in chunks, look > > into > > > >> > > chunksize > > > >> > > > > > > option > > > >> > > > > > > > > > while using Pandas dataframe to modify the script: > > > >> > > > > > > > > > > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > >> > > > > > > > > > > > >> > > > > > > > > — > > > >> > > > > > > > > You are receiving this because you commented. > > > >> > > > > > > > > Reply to this email directly, view it on GitHub > > > >> > > > > > > > > < > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > #13 (comment) > > > >> > > > > > > >, > > > >> > > > > > > > > or unsubscribe > > > >> > > > > > > > > < > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > >> > > > > > > > > > > >> > > > > > > > > . > > > >> > > > > > > > > > > > >> > > > > > > > — > > > >> > > > > > > > You are receiving this because you authored the thread. > > > >> > > > > > > > Reply to this email directly, view it on GitHub < > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > #13 (comment) > > > >> > > > > >, > > > >> > > > > > > or unsubscribe < > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > >> > > > > > > >. > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > — > > > >> > > > > > > You are receiving this because you commented. > > > >> > > > > > > Reply to this email directly, view it on GitHub > > > >> > > > > > > < > > > >> > > > > > > > >> > > > > > >> > > #13 (comment) > > > >> > > > > >, > > > >> > > > > > > or unsubscribe > > > >> > > > > > > < > > > >> > > > > > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > >> > > > > > > > > >> > > > > > > . > > > >> > > > > > > > > > >> > > > > > — > > > >> > > > > > You are receiving this because you authored the thread. > > > >> > > > > > Reply to this email directly, view it on GitHub < > > > >> > > > > > > > >> > > > > > >> > > #13 (comment) > > > >> > > >, > > > >> > > > > or unsubscribe < > > > >> > > > > > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > > >> > > > > >. > > > >> > > > > > > > > >> > > > > > > > >> > > > > — > > > >> > > > > You are receiving this because you commented. > > > >> > > > > Reply to this email directly, view it on GitHub > > > >> > > > > < > > > >> > > > > > >> > > #13 (comment) > > > >> > > >, > > > >> > > > > or unsubscribe > > > >> > > > > < > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > >> > > > > > > >> > > > > . > > > >> > > > > > > > >> > > > — > > > >> > > > You are receiving this because you authored the thread. > > > >> > > > Reply to this email directly, view it on GitHub < > > > >> > > > > > >> > > #13 (comment) > > > >> >, > > > >> > > or unsubscribe < > > > >> > > > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > > > >> > > >. > > > >> > > > > > > >> > > > > > >> > > — > > > >> > > You are receiving this because you commented. > > > >> > > Reply to this email directly, view it on GitHub > > > >> > > < > > > >> > > #13 (comment) > > > >> >, > > > >> > > or unsubscribe > > > >> > > < > > > >> > > https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > > > >> > > > > >> > > . > > > >> > > > > > >> > — > > > >> > You are receiving this because you authored the thread. > > > >> > Reply to this email directly, view it on GitHub < > > > >> > > #13 (comment) > > >, > > > >> or unsubscribe < > > > >> > > https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > > > >> >. > > > >> > > > > >> > > > >> — > > > >> You are receiving this because you commented. > > > >> Reply to this email directly, view it on GitHub > > > >> < > > #13 (comment) > > >, > > > >> or unsubscribe > > > >> < > > https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA > > > > > > >> . > > > >> > > > > > > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub < > > #13 (comment) >, > > or unsubscribe < > > https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA > > >. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < #13 (comment) >, > > or unsubscribe > > < https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA > > > . > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA> .

dkakkar · 2020-09-16T17:17:27Z

Did it work? On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar <[email protected]> wrote:

…

Try: *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);* On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***> wrote: > Knn: > source_id > neighbor_id > dist > > > mrg: > dpost > rpost > neighbor_id > > > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote: > > > > > > Pls send me column bames for both tables. > > > > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> > wrote: > > > > > Thanks Devika, > > > > > > That seems to fix those issues. I think the remaining issue is the > > > potential memory issue, which I can solve by outputting smaller > files, and > > > an issue when joining in sql/Omnisci. I am running up against a unique > > > constraint error that I do not understand. The rpost/dpost data frame > that > > > I am joining to the knn output will have multiple matches, since I am > > > joining it to neighbor_id, and sometimes people share neighbors. > There are > > > no duplicates in the rpost/dpost data frame, as it contains one row > for > > > each registered voter (or each potential neighbor, if you will). This > kind > > > of merge/join would not be a problem using similar functions in > python/R, > > > but seems to run up against a join difficulty in sql. Can you clarify > what > > > is going on? > > > > > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN > mrg ON > > > knn.neighbor_id = mrg.neighbor_id);") > > > Traceback (most recent call last): > > > File > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > line 118, in execute > > > at_most_n=-1, > > > File > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > line 1755, in sql_execute > > > return self.recv_sql_execute() > > > File > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > line 1784, in recv_sql_execute > > > raise result.e > > > omnisci.thrift.ttypes.TOmniSciException: > > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE > constraint > > > failed: mapd_columns.tableid, mapd_columns.name') > > > > > > The above exception was the direct cause of the following exception: > > > > > > ***@***.*** ~]$ > > > File "<stdin>", line 1, in <module> > > > File > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > > line 390, in execute > > > return c.execute(operation, parameters=parameters) > > > File > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > line 121, in execute > > > raise _translate_exception(e) from e > > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint > > > failed: mapd_columns.tableid, mapd_columns.name > > > > > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> > wrote: > > > > > > > > > > > > Please use TEXT ENCODING DICT wherever you define it. > > > > > > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> > > > wrote: > > > > > > > > > The data type for source_id is STR > > > > > > > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar < > ***@***.***> > > > > > wrote: > > > > > > > > > > What is the data type for source_id? > > > > > > > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < > ***@***.***> > > > > > wrote: > > > > > > > > > >> > > > > >> Hi Devika, > > > > >> > > > > >> So I have figured out how to handle reading in the zipped files, > and I > > > > >> have been able to read in some of the smaller files to both > Python and > > > > >> OmniSci. The issues I am running into now involve running the > > > modeling code > > > > >> you provided, as am getting errors related to grouping on string > > > columns. > > > > >> You can see that output below: > > > > >> > > > > >> >>> conn.execute("Create table results as (SELECT source_id, > > > AVG(dpost) > > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY > > > source_id);") > > > > >> Traceback (most recent call last): > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > > >> line 118, in execute > > > > >> at_most_n=-1, > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > > >> line 1755, in sql_execute > > > > >> return self.recv_sql_execute() > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > > >> line 1784, in recv_sql_execute > > > > >> raise result.e > > > > >> omnisci.thrift.ttypes.TOmniSciException: > > > > >> TOmniSciException(error_msg='Exception: Cannot group by string > columns > > > > >> which are not dictionary encoded.') > > > > >> > > > > >> The above exception was the direct cause of the following > exception: > > > > >> > > > > >> Traceback (most recent call last): > > > > >> File "<stdin>", line 1, in <module> > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > > > >> line 390, in execute > > > > >> return c.execute(operation, parameters=parameters) > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > > >> line 121, in execute > > > > >> raise _translate_exception(e) from e > > > > >> pymapd.exceptions.Error: Exception: Cannot group by string > columns > > > which > > > > >> are not dictionary encoded. > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> I also got an error that I could not join tables using TEXT type > > > > >> variables in OmniSci. This occurred when I was trying to merge > in the > > > new > > > > >> rpost and dpost values: > > > > >> > > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT > JOIN > > > mrg > > > > >> ON knn.neighbor_id = mrg.neighbor_id);") > > > > >> Traceback (most recent call last): > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > > >> line 118, in execute > > > > >> at_most_n=-1, > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > > >> line 1755, in sql_execute > > > > >> return self.recv_sql_execute() > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > > > > >> line 1784, in recv_sql_execute > > > > >> raise result.e > > > > >> omnisci.thrift.ttypes.TOmniSciException: > > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not > > > supported > > > > >> for outer joins yet') > > > > >> > > > > >> The above exception was the direct cause of the following > exception: > > > > >> > > > > >> Traceback (most recent call last): > > > > >> File "<stdin>", line 1, in <module> > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > > > > >> line 390, in execute > > > > >> return c.execute(operation, parameters=parameters) > > > > >> File > > > > >> > > > > "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > > > > >> line 121, in execute > > > > >> raise _translate_exception(e) from e > > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not > supported > > > > >> for outer joins yet > > > > >> > > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** > > > > > wrote: > > > > >> > > > > > >> > > > > > >> > Sure, take your time. > > > > >> > > > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < > > > ***@***.***> > > > > >> > wrote: > > > > >> > > > > > >> > > Hi Devika, > > > > >> > > > > > > >> > > You can disregard my last email, I am still troubleshooting > some > > > > >> things > > > > >> > > I’ll give a full report in a few hours. > > > > >> > > > > > > >> > > Thanks, > > > > >> > > > > > > >> > > Jake > > > > >> > > > > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar < > ***@***.***> > > > > >> wrote: > > > > >> > > > > > > > >> > > > > > > > >> > > > Yes. > > > > >> > > > > > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > > > > >> ***@***.***> > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > Thanks ill look into this. Is one potential solution also > > > zipping > > > > >> the > > > > >> > > file > > > > >> > > > > such that it only has the extension .gz? > > > > >> > > > > > > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < > > > ***@***.***> > > > > >> > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > You are ready .tar.gz compressed file but in your > dataframe > > > > >> read CSV > > > > >> > > you > > > > >> > > > > > are mentioning .gz compressed. This is causing the > problem. > > > > >> Could you > > > > >> > > > > look > > > > >> > > > > > into how to read .tar.gz compression to dataframe. > > > > >> > > > > > > > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > > > > >> > > ***@***.***> > > > > >> > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > Hi Devika, > > > > >> > > > > > > > > > > >> > > > > > > After looking at this more one of the issues might > have > > > to do > > > > >> with > > > > >> > > how > > > > >> > > > > it > > > > >> > > > > > > is being read into Python. When I read in the tarred > file > > > > >> directly > > > > >> > > into > > > > >> > > > > > > python, there is a weird value in the first row and > first > > > > >> column > > > > >> > > > > > > intersection. This does not occur if I first unzip > the > > > file > > > > >> and > > > > >> > > then > > > > >> > > > > load > > > > >> > > > > > > the .csv into Python. Why might this be happening? > See > > > below: > > > > >> > > > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > >> > > > > > > low_memory='true',compression='gzip', header=None) > > > > >> > > > > > > df.head() > > > > >> > > > > > > >>> df.head() > > > > >> > > > > > > 0 1 2 3 4 5 6 > > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... > > > AK-709502 > > > > >> i d > > > > >> > > 0 \N > > > > >> > > > > \N > > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > Compared to this when reading in the unzipped file: > > > > >> > > > > > > > > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > > > > >> > > low_memory='true',header=None) > > > > >> > > > > > > >>> df.head() > > > > >> > > > > > > 0 1 2 3 4 5 6 > > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > > > > >> > > > > > > >>> > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > > > > >> ***@***.***> > > > > >> > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > I think it is a memory issue. Please divide the > file in > > > > >> smaller > > > > >> > > size > > > > >> > > > > and > > > > >> > > > > > > > try again and let's see what happens. > > > > >> > > > > > > > > > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > > > > >> > > > > ***@***.***> > > > > >> > > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue > but > > > also > > > > >> recall > > > > >> > > > > that > > > > >> > > > > > > last > > > > >> > > > > > > > > night the process died while reading one of the > > > smaller > > > > >> tables > > > > >> > > (RI) > > > > >> > > > > > > into > > > > >> > > > > > > > > OmniSci, so after successfully loading it into > the > > > Python > > > > >> > > > > environment. > > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > > > > >> > > ***@***.***> > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Then your dataframe is running out of memory to > > > read the > > > > >> > > whole > > > > >> > > > > file > > > > >> > > > > > > at > > > > >> > > > > > > > > once > > > > >> > > > > > > > > > since it's too big. Please read it in chunks, > look > > > into > > > > >> > > chunksize > > > > >> > > > > > > option > > > > >> > > > > > > > > > while using Pandas dataframe to modify the > script: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > > > > >> > > > > > > > > > > > > >> > > > > > > > > — > > > > >> > > > > > > > > You are receiving this because you commented. > > > > >> > > > > > > > > Reply to this email directly, view it on GitHub > > > > >> > > > > > > > > < > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> > > > > > > >, > > > > >> > > > > > > > > or unsubscribe > > > > >> > > > > > > > > < > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > > > > >> > > > > > > > > > > > >> > > > > > > > > . > > > > >> > > > > > > > > > > > > >> > > > > > > > — > > > > >> > > > > > > > You are receiving this because you authored the > thread. > > > > >> > > > > > > > Reply to this email directly, view it on GitHub < > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> > > > > >, > > > > >> > > > > > > or unsubscribe < > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > > > > >> > > > > > > >. > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > — > > > > >> > > > > > > You are receiving this because you commented. > > > > >> > > > > > > Reply to this email directly, view it on GitHub > > > > >> > > > > > > < > > > > >> > > > > > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> > > > > >, > > > > >> > > > > > > or unsubscribe > > > > >> > > > > > > < > > > > >> > > > > > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > > > > >> > > > > > > > > > >> > > > > > > . > > > > >> > > > > > > > > > > >> > > > > > — > > > > >> > > > > > You are receiving this because you authored the thread. > > > > >> > > > > > Reply to this email directly, view it on GitHub < > > > > >> > > > > > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> > > >, > > > > >> > > > > or unsubscribe < > > > > >> > > > > > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > > > > >> > > > > >. > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > — > > > > >> > > > > You are receiving this because you commented. > > > > >> > > > > Reply to this email directly, view it on GitHub > > > > >> > > > > < > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> > > >, > > > > >> > > > > or unsubscribe > > > > >> > > > > < > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > > > > >> > > > > > > > >> > > > > . > > > > >> > > > > > > > > >> > > > — > > > > >> > > > You are receiving this because you authored the thread. > > > > >> > > > Reply to this email directly, view it on GitHub < > > > > >> > > > > > > >> > > > > #13 (comment) > > > > >> >, > > > > >> > > or unsubscribe < > > > > >> > > > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > > > > >> > > >. > > > > >> > > > > > > > >> > > > > > > >> > > — > > > > >> > > You are receiving this because you commented. > > > > >> > > Reply to this email directly, view it on GitHub > > > > >> > > < > > > > >> > > > > #13 (comment) > > > > >> >, > > > > >> > > or unsubscribe > > > > >> > > < > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > > > > >> > > > > > >> > > . > > > > >> > > > > > > >> > — > > > > >> > You are receiving this because you authored the thread. > > > > >> > Reply to this email directly, view it on GitHub < > > > > >> > > > > #13 (comment) > > > >, > > > > >> or unsubscribe < > > > > >> > > > > https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > > > > >> >. > > > > >> > > > > > >> > > > > >> — > > > > >> You are receiving this because you commented. > > > > >> Reply to this email directly, view it on GitHub > > > > >> < > > > > #13 (comment) > > > >, > > > > >> or unsubscribe > > > > >> < > > > > https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA > > > > > > > > >> . > > > > >> > > > > > > > > > > > > > > — > > > > You are receiving this because you authored the thread. > > > > Reply to this email directly, view it on GitHub < > > > > #13 (comment) > >, > > > or unsubscribe < > > > > https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA > > > >. > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > < > #13 (comment) > >, > > > or unsubscribe > > > < > https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA > > > > > . > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub < > #13 (comment)>, > or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA > >. > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#13 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA> > . >

jakerbrown · 2020-09-16T17:20:08Z

Seems to work right now yes. Thank you!

…

On Sep 16, 2020, at 1:17 PM, dkakkar ***@***.***> wrote: Did it work? On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar ***@***.***> wrote: > Try: > > > *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, > b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);* > > On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***> > wrote: > >> Knn: >> source_id >> neighbor_id >> dist >> >> >> mrg: >> dpost >> rpost >> neighbor_id >> >> > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote: >> > >> > >> > Pls send me column bames for both tables. >> > >> > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> >> wrote: >> > >> > > Thanks Devika, >> > > >> > > That seems to fix those issues. I think the remaining issue is the >> > > potential memory issue, which I can solve by outputting smaller >> files, and >> > > an issue when joining in sql/Omnisci. I am running up against a unique >> > > constraint error that I do not understand. The rpost/dpost data frame >> that >> > > I am joining to the knn output will have multiple matches, since I am >> > > joining it to neighbor_id, and sometimes people share neighbors. >> There are >> > > no duplicates in the rpost/dpost data frame, as it contains one row >> for >> > > each registered voter (or each potential neighbor, if you will). This >> kind >> > > of merge/join would not be a problem using similar functions in >> python/R, >> > > but seems to run up against a join difficulty in sql. Can you clarify >> what >> > > is going on? >> > > >> > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN >> mrg ON >> > > knn.neighbor_id = mrg.neighbor_id);") >> > > Traceback (most recent call last): >> > > File >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > line 118, in execute >> > > at_most_n=-1, >> > > File >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > line 1755, in sql_execute >> > > return self.recv_sql_execute() >> > > File >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > line 1784, in recv_sql_execute >> > > raise result.e >> > > omnisci.thrift.ttypes.TOmniSciException: >> > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE >> constraint >> > > failed: mapd_columns.tableid, mapd_columns.name') >> > > >> > > The above exception was the direct cause of the following exception: >> > > >> > > ***@***.*** ~]$ >> > > File "<stdin>", line 1, in <module> >> > > File >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", >> > > line 390, in execute >> > > return c.execute(operation, parameters=parameters) >> > > File >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > line 121, in execute >> > > raise _translate_exception(e) from e >> > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint >> > > failed: mapd_columns.tableid, mapd_columns.name >> > > >> > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> >> wrote: >> > > > >> > > > >> > > > Please use TEXT ENCODING DICT wherever you define it. >> > > > >> > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> >> > > wrote: >> > > > >> > > > > The data type for source_id is STR >> > > > > >> > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar < >> ***@***.***> >> > > > > wrote: >> > > > > >> > > > > What is the data type for source_id? >> > > > > >> > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < >> ***@***.***> >> > > > > wrote: >> > > > > >> > > > >> >> > > > >> Hi Devika, >> > > > >> >> > > > >> So I have figured out how to handle reading in the zipped files, >> and I >> > > > >> have been able to read in some of the smaller files to both >> Python and >> > > > >> OmniSci. The issues I am running into now involve running the >> > > modeling code >> > > > >> you provided, as am getting errors related to grouping on string >> > > columns. >> > > > >> You can see that output below: >> > > > >> >> > > > >> >>> conn.execute("Create table results as (SELECT source_id, >> > > AVG(dpost) >> > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY >> > > source_id);") >> > > > >> Traceback (most recent call last): >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > > >> line 118, in execute >> > > > >> at_most_n=-1, >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > > >> line 1755, in sql_execute >> > > > >> return self.recv_sql_execute() >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > > >> line 1784, in recv_sql_execute >> > > > >> raise result.e >> > > > >> omnisci.thrift.ttypes.TOmniSciException: >> > > > >> TOmniSciException(error_msg='Exception: Cannot group by string >> columns >> > > > >> which are not dictionary encoded.') >> > > > >> >> > > > >> The above exception was the direct cause of the following >> exception: >> > > > >> >> > > > >> Traceback (most recent call last): >> > > > >> File "<stdin>", line 1, in <module> >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", >> > > > >> line 390, in execute >> > > > >> return c.execute(operation, parameters=parameters) >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > > >> line 121, in execute >> > > > >> raise _translate_exception(e) from e >> > > > >> pymapd.exceptions.Error: Exception: Cannot group by string >> columns >> > > which >> > > > >> are not dictionary encoded. >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> I also got an error that I could not join tables using TEXT type >> > > > >> variables in OmniSci. This occurred when I was trying to merge >> in the >> > > new >> > > > >> rpost and dpost values: >> > > > >> >> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT >> JOIN >> > > mrg >> > > > >> ON knn.neighbor_id = mrg.neighbor_id);") >> > > > >> Traceback (most recent call last): >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > > >> line 118, in execute >> > > > >> at_most_n=-1, >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > > >> line 1755, in sql_execute >> > > > >> return self.recv_sql_execute() >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", >> > > > >> line 1784, in recv_sql_execute >> > > > >> raise result.e >> > > > >> omnisci.thrift.ttypes.TOmniSciException: >> > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not >> > > supported >> > > > >> for outer joins yet') >> > > > >> >> > > > >> The above exception was the direct cause of the following >> exception: >> > > > >> >> > > > >> Traceback (most recent call last): >> > > > >> File "<stdin>", line 1, in <module> >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", >> > > > >> line 390, in execute >> > > > >> return c.execute(operation, parameters=parameters) >> > > > >> File >> > > > >> >> > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", >> > > > >> line 121, in execute >> > > > >> raise _translate_exception(e) from e >> > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not >> supported >> > > > >> for outer joins yet >> > > > >> >> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** >> > >> > > wrote: >> > > > >> > >> > > > >> > >> > > > >> > Sure, take your time. >> > > > >> > >> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < >> > > ***@***.***> >> > > > >> > wrote: >> > > > >> > >> > > > >> > > Hi Devika, >> > > > >> > > >> > > > >> > > You can disregard my last email, I am still troubleshooting >> some >> > > > >> things >> > > > >> > > I’ll give a full report in a few hours. >> > > > >> > > >> > > > >> > > Thanks, >> > > > >> > > >> > > > >> > > Jake >> > > > >> > > >> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar < >> ***@***.***> >> > > > >> wrote: >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > Yes. >> > > > >> > > > >> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < >> > > > >> ***@***.***> >> > > > >> > > > wrote: >> > > > >> > > > >> > > > >> > > > > Thanks ill look into this. Is one potential solution also >> > > zipping >> > > > >> the >> > > > >> > > file >> > > > >> > > > > such that it only has the extension .gz? >> > > > >> > > > > >> > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < >> > > ***@***.***> >> > > > >> > > wrote: >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > You are ready .tar.gz compressed file but in your >> dataframe >> > > > >> read CSV >> > > > >> > > you >> > > > >> > > > > > are mentioning .gz compressed. This is causing the >> problem. >> > > > >> Could you >> > > > >> > > > > look >> > > > >> > > > > > into how to read .tar.gz compression to dataframe. >> > > > >> > > > > > >> > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < >> > > > >> > > ***@***.***> >> > > > >> > > > > > wrote: >> > > > >> > > > > > >> > > > >> > > > > > > Hi Devika, >> > > > >> > > > > > > >> > > > >> > > > > > > After looking at this more one of the issues might >> have >> > > to do >> > > > >> with >> > > > >> > > how >> > > > >> > > > > it >> > > > >> > > > > > > is being read into Python. When I read in the tarred >> file >> > > > >> directly >> > > > >> > > into >> > > > >> > > > > > > python, there is a weird value in the first row and >> first >> > > > >> column >> > > > >> > > > > > > intersection. This does not occur if I first unzip >> the >> > > file >> > > > >> and >> > > > >> > > then >> > > > >> > > > > load >> > > > >> > > > > > > the .csv into Python. Why might this be happening? >> See >> > > below: >> > > > >> > > > > > > >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, >> > > > >> > > > > > > low_memory='true',compression='gzip', header=None) >> > > > >> > > > > > > df.head() >> > > > >> > > > > > > >>> df.head() >> > > > >> > > > > > > 0 1 2 3 4 5 6 >> > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... >> > > AK-709502 >> > > > >> i d >> > > > >> > > 0 \N >> > > > >> > > > > \N >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > Compared to this when reading in the unzipped file: >> > > > >> > > > > > > >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, >> > > > >> > > low_memory='true',header=None) >> > > > >> > > > > > > >>> df.head() >> > > > >> > > > > > > 0 1 2 3 4 5 6 >> > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N >> > > > >> > > > > > > >>> >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < >> > > > >> ***@***.***> >> > > > >> > > > > wrote: >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > I think it is a memory issue. Please divide the >> file in >> > > > >> smaller >> > > > >> > > size >> > > > >> > > > > and >> > > > >> > > > > > > > try again and let's see what happens. >> > > > >> > > > > > > > >> > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < >> > > > >> > > > > ***@***.***> >> > > > >> > > > > > > > wrote: >> > > > >> > > > > > > > >> > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue >> but >> > > also >> > > > >> recall >> > > > >> > > > > that >> > > > >> > > > > > > last >> > > > >> > > > > > > > > night the process died while reading one of the >> > > smaller >> > > > >> tables >> > > > >> > > (RI) >> > > > >> > > > > > > into >> > > > >> > > > > > > > > OmniSci, so after successfully loading it into >> the >> > > Python >> > > > >> > > > > environment. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < >> > > > >> > > ***@***.***> >> > > > >> > > > > > > wrote: >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > Then your dataframe is running out of memory to >> > > read the >> > > > >> > > whole >> > > > >> > > > > file >> > > > >> > > > > > > at >> > > > >> > > > > > > > > once >> > > > >> > > > > > > > > > since it's too big. Please read it in chunks, >> look >> > > into >> > > > >> > > chunksize >> > > > >> > > > > > > option >> > > > >> > > > > > > > > > while using Pandas dataframe to modify the >> script: >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > — >> > > > >> > > > > > > > > You are receiving this because you commented. >> > > > >> > > > > > > > > Reply to this email directly, view it on GitHub >> > > > >> > > > > > > > > < >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> > > > > > > >, >> > > > >> > > > > > > > > or unsubscribe >> > > > >> > > > > > > > > < >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA >> > > > >> > > > > > > > >> > > > >> > > > > > > > > . >> > > > >> > > > > > > > > >> > > > >> > > > > > > > — >> > > > >> > > > > > > > You are receiving this because you authored the >> thread. >> > > > >> > > > > > > > Reply to this email directly, view it on GitHub < >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> > > > > >, >> > > > >> > > > > > > or unsubscribe < >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA >> > > > >> > > > > > > >. >> > > > >> > > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > — >> > > > >> > > > > > > You are receiving this because you commented. >> > > > >> > > > > > > Reply to this email directly, view it on GitHub >> > > > >> > > > > > > < >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> > > > > >, >> > > > >> > > > > > > or unsubscribe >> > > > >> > > > > > > < >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA >> > > > >> > > > > > >> > > > >> > > > > > > . >> > > > >> > > > > > > >> > > > >> > > > > > — >> > > > >> > > > > > You are receiving this because you authored the thread. >> > > > >> > > > > > Reply to this email directly, view it on GitHub < >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> > > >, >> > > > >> > > > > or unsubscribe < >> > > > >> > > > > >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA >> > > > >> > > > > >. >> > > > >> > > > > > >> > > > >> > > > > >> > > > >> > > > > — >> > > > >> > > > > You are receiving this because you commented. >> > > > >> > > > > Reply to this email directly, view it on GitHub >> > > > >> > > > > < >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> > > >, >> > > > >> > > > > or unsubscribe >> > > > >> > > > > < >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA >> > > > >> > > > >> > > > >> > > > > . >> > > > >> > > > > >> > > > >> > > > — >> > > > >> > > > You are receiving this because you authored the thread. >> > > > >> > > > Reply to this email directly, view it on GitHub < >> > > > >> > > >> > > > >> >> > > >> #13 (comment) >> > > > >> >, >> > > > >> > > or unsubscribe < >> > > > >> > > >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA >> > > > >> > > >. >> > > > >> > > > >> > > > >> > > >> > > > >> > > — >> > > > >> > > You are receiving this because you commented. >> > > > >> > > Reply to this email directly, view it on GitHub >> > > > >> > > < >> > > > >> >> > > >> #13 (comment) >> > > > >> >, >> > > > >> > > or unsubscribe >> > > > >> > > < >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA >> > > > >> > >> > > > >> > > . >> > > > >> > > >> > > > >> > — >> > > > >> > You are receiving this because you authored the thread. >> > > > >> > Reply to this email directly, view it on GitHub < >> > > > >> >> > > >> #13 (comment) >> > > >, >> > > > >> or unsubscribe < >> > > > >> >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA >> > > > >> >. >> > > > >> > >> > > > >> >> > > > >> — >> > > > >> You are receiving this because you commented. >> > > > >> Reply to this email directly, view it on GitHub >> > > > >> < >> > > >> #13 (comment) >> > > >, >> > > > >> or unsubscribe >> > > > >> < >> > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA >> > > > >> > > > >> . >> > > > >> >> > > > > >> > > > > >> > > > — >> > > > You are receiving this because you authored the thread. >> > > > Reply to this email directly, view it on GitHub < >> > > >> #13 (comment) >> >, >> > > or unsubscribe < >> > > >> https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA >> > > >. >> > > > >> > > >> > > — >> > > You are receiving this because you commented. >> > > Reply to this email directly, view it on GitHub >> > > < >> #13 (comment) >> >, >> > > or unsubscribe >> > > < >> https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA >> > >> > > . >> > > >> > — >> > You are receiving this because you authored the thread. >> > Reply to this email directly, view it on GitHub < >> #13 (comment)>, >> or unsubscribe < >> https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA >> >. >> > >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#13 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA> >> . >> > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA>.

dkakkar · 2020-09-16T17:25:38Z

Just FYI, you were trying to select all columns from both tables previously and in the final merged table you cannot have two columns with the same name (neigbor_id) so it was throwing a unique constraint. On Wed, Sep 16, 2020 at 1:20 PM Jacob Brown <[email protected]> wrote:

…

Seems to work right now yes. Thank you! > On Sep 16, 2020, at 1:17 PM, dkakkar ***@***.***> wrote: > > > Did it work? > > On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar ***@***.***> > wrote: > > > Try: > > > > > > *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost, > > b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);* > > > > On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***> > > wrote: > > > >> Knn: > >> source_id > >> neighbor_id > >> dist > >> > >> > >> mrg: > >> dpost > >> rpost > >> neighbor_id > >> > >> > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote: > >> > > >> > > >> > Pls send me column bames for both tables. > >> > > >> > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.*** > > >> wrote: > >> > > >> > > Thanks Devika, > >> > > > >> > > That seems to fix those issues. I think the remaining issue is the > >> > > potential memory issue, which I can solve by outputting smaller > >> files, and > >> > > an issue when joining in sql/Omnisci. I am running up against a unique > >> > > constraint error that I do not understand. The rpost/dpost data frame > >> that > >> > > I am joining to the knn output will have multiple matches, since I am > >> > > joining it to neighbor_id, and sometimes people share neighbors. > >> There are > >> > > no duplicates in the rpost/dpost data frame, as it contains one row > >> for > >> > > each registered voter (or each potential neighbor, if you will). This > >> kind > >> > > of merge/join would not be a problem using similar functions in > >> python/R, > >> > > but seems to run up against a join difficulty in sql. Can you clarify > >> what > >> > > is going on? > >> > > > >> > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN > >> mrg ON > >> > > knn.neighbor_id = mrg.neighbor_id);") > >> > > Traceback (most recent call last): > >> > > File > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > line 118, in execute > >> > > at_most_n=-1, > >> > > File > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > line 1755, in sql_execute > >> > > return self.recv_sql_execute() > >> > > File > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > line 1784, in recv_sql_execute > >> > > raise result.e > >> > > omnisci.thrift.ttypes.TOmniSciException: > >> > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE > >> constraint > >> > > failed: mapd_columns.tableid, mapd_columns.name') > >> > > > >> > > The above exception was the direct cause of the following exception: > >> > > > >> > > ***@***.*** ~]$ > >> > > File "<stdin>", line 1, in <module> > >> > > File > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > >> > > line 390, in execute > >> > > return c.execute(operation, parameters=parameters) > >> > > File > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > line 121, in execute > >> > > raise _translate_exception(e) from e > >> > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint > >> > > failed: mapd_columns.tableid, mapd_columns.name > >> > > > >> > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> > >> wrote: > >> > > > > >> > > > > >> > > > Please use TEXT ENCODING DICT wherever you define it. > >> > > > > >> > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown < ***@***.***> > >> > > wrote: > >> > > > > >> > > > > The data type for source_id is STR > >> > > > > > >> > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar < > >> ***@***.***> > >> > > > > wrote: > >> > > > > > >> > > > > What is the data type for source_id? > >> > > > > > >> > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown < > >> ***@***.***> > >> > > > > wrote: > >> > > > > > >> > > > >> > >> > > > >> Hi Devika, > >> > > > >> > >> > > > >> So I have figured out how to handle reading in the zipped files, > >> and I > >> > > > >> have been able to read in some of the smaller files to both > >> Python and > >> > > > >> OmniSci. The issues I am running into now involve running the > >> > > modeling code > >> > > > >> you provided, as am getting errors related to grouping on string > >> > > columns. > >> > > > >> You can see that output below: > >> > > > >> > >> > > > >> >>> conn.execute("Create table results as (SELECT source_id, > >> > > AVG(dpost) > >> > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * > >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * > >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY > >> > > source_id);") > >> > > > >> Traceback (most recent call last): > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > > >> line 118, in execute > >> > > > >> at_most_n=-1, > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > > >> line 1755, in sql_execute > >> > > > >> return self.recv_sql_execute() > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > > >> line 1784, in recv_sql_execute > >> > > > >> raise result.e > >> > > > >> omnisci.thrift.ttypes.TOmniSciException: > >> > > > >> TOmniSciException(error_msg='Exception: Cannot group by string > >> columns > >> > > > >> which are not dictionary encoded.') > >> > > > >> > >> > > > >> The above exception was the direct cause of the following > >> exception: > >> > > > >> > >> > > > >> Traceback (most recent call last): > >> > > > >> File "<stdin>", line 1, in <module> > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > >> > > > >> line 390, in execute > >> > > > >> return c.execute(operation, parameters=parameters) > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > > >> line 121, in execute > >> > > > >> raise _translate_exception(e) from e > >> > > > >> pymapd.exceptions.Error: Exception: Cannot group by string > >> columns > >> > > which > >> > > > >> are not dictionary encoded. > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> I also got an error that I could not join tables using TEXT type > >> > > > >> variables in OmniSci. This occurred when I was trying to merge > >> in the > >> > > new > >> > > > >> rpost and dpost values: > >> > > > >> > >> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT > >> JOIN > >> > > mrg > >> > > > >> ON knn.neighbor_id = mrg.neighbor_id);") > >> > > > >> Traceback (most recent call last): > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > > >> line 118, in execute > >> > > > >> at_most_n=-1, > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > > >> line 1755, in sql_execute > >> > > > >> return self.recv_sql_execute() > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", > >> > > > >> line 1784, in recv_sql_execute > >> > > > >> raise result.e > >> > > > >> omnisci.thrift.ttypes.TOmniSciException: > >> > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not > >> > > supported > >> > > > >> for outer joins yet') > >> > > > >> > >> > > > >> The above exception was the direct cause of the following > >> exception: > >> > > > >> > >> > > > >> Traceback (most recent call last): > >> > > > >> File "<stdin>", line 1, in <module> > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", > >> > > > >> line 390, in execute > >> > > > >> return c.execute(operation, parameters=parameters) > >> > > > >> File > >> > > > >> > >> > > > >> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", > >> > > > >> line 121, in execute > >> > > > >> raise _translate_exception(e) from e > >> > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not > >> supported > >> > > > >> for outer joins yet > >> > > > >> > >> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar < ***@***.*** > >> > > >> > > wrote: > >> > > > >> > > >> > > > >> > > >> > > > >> > Sure, take your time. > >> > > > >> > > >> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown < > >> > > ***@***.***> > >> > > > >> > wrote: > >> > > > >> > > >> > > > >> > > Hi Devika, > >> > > > >> > > > >> > > > >> > > You can disregard my last email, I am still troubleshooting > >> some > >> > > > >> things > >> > > > >> > > I’ll give a full report in a few hours. > >> > > > >> > > > >> > > > >> > > Thanks, > >> > > > >> > > > >> > > > >> > > Jake > >> > > > >> > > > >> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar < > >> ***@***.***> > >> > > > >> wrote: > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > Yes. > >> > > > >> > > > > >> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown < > >> > > > >> ***@***.***> > >> > > > >> > > > wrote: > >> > > > >> > > > > >> > > > >> > > > > Thanks ill look into this. Is one potential solution also > >> > > zipping > >> > > > >> the > >> > > > >> > > file > >> > > > >> > > > > such that it only has the extension .gz? > >> > > > >> > > > > > >> > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar < > >> > > ***@***.***> > >> > > > >> > > wrote: > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > You are ready .tar.gz compressed file but in your > >> dataframe > >> > > > >> read CSV > >> > > > >> > > you > >> > > > >> > > > > > are mentioning .gz compressed. This is causing the > >> problem. > >> > > > >> Could you > >> > > > >> > > > > look > >> > > > >> > > > > > into how to read .tar.gz compression to dataframe. > >> > > > >> > > > > > > >> > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown < > >> > > > >> > > ***@***.***> > >> > > > >> > > > > > wrote: > >> > > > >> > > > > > > >> > > > >> > > > > > > Hi Devika, > >> > > > >> > > > > > > > >> > > > >> > > > > > > After looking at this more one of the issues might > >> have > >> > > to do > >> > > > >> with > >> > > > >> > > how > >> > > > >> > > > > it > >> > > > >> > > > > > > is being read into Python. When I read in the tarred > >> file > >> > > > >> directly > >> > > > >> > > into > >> > > > >> > > > > > > python, there is a weird value in the first row and > >> first > >> > > > >> column > >> > > > >> > > > > > > intersection. This does not occur if I first unzip > >> the > >> > > file > >> > > > >> and > >> > > > >> > > then > >> > > > >> > > > > load > >> > > > >> > > > > > > the .csv into Python. Why might this be happening? > >> See > >> > > below: > >> > > > >> > > > > > > > >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', > >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > >> > > > >> > > > > > > low_memory='true',compression='gzip', header=None) > >> > > > >> > > > > > > df.head() > >> > > > >> > > > > > > >>> df.head() > >> > > > >> > > > > > > 0 1 2 3 4 5 6 > >> > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... > >> > > AK-709502 > >> > > > >> i d > >> > > > >> > > 0 \N > >> > > > >> > > > > \N > >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > Compared to this when reading in the unzipped file: > >> > > > >> > > > > > > > >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv', > >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None, > >> > > > >> > > low_memory='true',header=None) > >> > > > >> > > > > > > >>> df.head() > >> > > > >> > > > > > > 0 1 2 3 4 5 6 > >> > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N > >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N > >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N > >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N > >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N > >> > > > >> > > > > > > >>> > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar < > >> > > > >> ***@***.***> > >> > > > >> > > > > wrote: > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > I think it is a memory issue. Please divide the > >> file in > >> > > > >> smaller > >> > > > >> > > size > >> > > > >> > > > > and > >> > > > >> > > > > > > > try again and let's see what happens. > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown < > >> > > > >> > > > > ***@***.***> > >> > > > >> > > > > > > > wrote: > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue > >> but > >> > > also > >> > > > >> recall > >> > > > >> > > > > that > >> > > > >> > > > > > > last > >> > > > >> > > > > > > > > night the process died while reading one of the > >> > > smaller > >> > > > >> tables > >> > > > >> > > (RI) > >> > > > >> > > > > > > into > >> > > > >> > > > > > > > > OmniSci, so after successfully loading it into > >> the > >> > > Python > >> > > > >> > > > > environment. > >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar < > >> > > > >> > > ***@***.***> > >> > > > >> > > > > > > wrote: > >> > > > >> > > > > > > > > > > >> > > > >> > > > > > > > > > Then your dataframe is running out of memory to > >> > > read the > >> > > > >> > > whole > >> > > > >> > > > > file > >> > > > >> > > > > > > at > >> > > > >> > > > > > > > > once > >> > > > >> > > > > > > > > > since it's too big. Please read it in chunks, > >> look > >> > > into > >> > > > >> > > chunksize > >> > > > >> > > > > > > option > >> > > > >> > > > > > > > > > while using Pandas dataframe to modify the > >> script: > >> > > > >> > > > > > > > > > > >> > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize) > >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > — > >> > > > >> > > > > > > > > You are receiving this because you commented. > >> > > > >> > > > > > > > > Reply to this email directly, view it on GitHub > >> > > > >> > > > > > > > > < > >> > > > >> > > > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> > > > > > > >, > >> > > > >> > > > > > > > > or unsubscribe > >> > > > >> > > > > > > > > < > >> > > > >> > > > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > . > >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > — > >> > > > >> > > > > > > > You are receiving this because you authored the > >> thread. > >> > > > >> > > > > > > > Reply to this email directly, view it on GitHub < > >> > > > >> > > > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> > > > > >, > >> > > > >> > > > > > > or unsubscribe < > >> > > > >> > > > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA > >> > > > >> > > > > > > >. > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > — > >> > > > >> > > > > > > You are receiving this because you commented. > >> > > > >> > > > > > > Reply to this email directly, view it on GitHub > >> > > > >> > > > > > > < > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> > > > > >, > >> > > > >> > > > > > > or unsubscribe > >> > > > >> > > > > > > < > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA > >> > > > >> > > > > > > >> > > > >> > > > > > > . > >> > > > >> > > > > > > > >> > > > >> > > > > > — > >> > > > >> > > > > > You are receiving this because you authored the thread. > >> > > > >> > > > > > Reply to this email directly, view it on GitHub < > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> > > >, > >> > > > >> > > > > or unsubscribe < > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA > >> > > > >> > > > > >. > >> > > > >> > > > > > > >> > > > >> > > > > > >> > > > >> > > > > — > >> > > > >> > > > > You are receiving this because you commented. > >> > > > >> > > > > Reply to this email directly, view it on GitHub > >> > > > >> > > > > < > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> > > >, > >> > > > >> > > > > or unsubscribe > >> > > > >> > > > > < > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA > >> > > > >> > > > > >> > > > >> > > > > . > >> > > > >> > > > > > >> > > > >> > > > — > >> > > > >> > > > You are receiving this because you authored the thread. > >> > > > >> > > > Reply to this email directly, view it on GitHub < > >> > > > >> > > > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> >, > >> > > > >> > > or unsubscribe < > >> > > > >> > > > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA > >> > > > >> > > >. > >> > > > >> > > > > >> > > > >> > > > >> > > > >> > > — > >> > > > >> > > You are receiving this because you commented. > >> > > > >> > > Reply to this email directly, view it on GitHub > >> > > > >> > > < > >> > > > >> > >> > > > >> #13 (comment) > >> > > > >> >, > >> > > > >> > > or unsubscribe > >> > > > >> > > < > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA > >> > > > >> > > >> > > > >> > > . > >> > > > >> > > > >> > > > >> > — > >> > > > >> > You are receiving this because you authored the thread. > >> > > > >> > Reply to this email directly, view it on GitHub < > >> > > > >> > >> > > > >> #13 (comment) > >> > > >, > >> > > > >> or unsubscribe < > >> > > > >> > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA > >> > > > >> >. > >> > > > >> > > >> > > > >> > >> > > > >> — > >> > > > >> You are receiving this because you commented. > >> > > > >> Reply to this email directly, view it on GitHub > >> > > > >> < > >> > > > >> #13 (comment) > >> > > >, > >> > > > >> or unsubscribe > >> > > > >> < > >> > > > >> https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA > >> > > > > >> > > > >> . > >> > > > >> > >> > > > > > >> > > > > > >> > > > — > >> > > > You are receiving this because you authored the thread. > >> > > > Reply to this email directly, view it on GitHub < > >> > > > >> #13 (comment) > >> >, > >> > > or unsubscribe < > >> > > > >> https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA > >> > > >. > >> > > > > >> > > > >> > > — > >> > > You are receiving this because you commented. > >> > > Reply to this email directly, view it on GitHub > >> > > < > >> #13 (comment) > >> >, > >> > > or unsubscribe > >> > > < > >> https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA > >> > > >> > > . > >> > > > >> > — > >> > You are receiving this because you authored the thread. > >> > Reply to this email directly, view it on GitHub < > >> #13 (comment) >, > >> or unsubscribe < > >> https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA > >> >. > >> > > >> > >> — > >> You are receiving this because you commented. > >> Reply to this email directly, view it on GitHub > >> < #13 (comment) >, > >> or unsubscribe > >> < https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA > > >> . > >> > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub < #13 (comment)>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA >. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2HRVXHD7CMGZ63JGCLSGDXVRANCNFSM4RLYKCIA> .

Error loading tables into omni sci in knn_model.py #13

Error loading tables into omni sci in knn_model.py #13

Comments

jakerbrown commented Sep 14, 2020

dkakkar commented Sep 14, 2020

jakerbrown commented Sep 14, 2020 via email

dkakkar commented Sep 14, 2020

dkakkar commented Sep 14, 2020

jakerbrown commented Sep 14, 2020 via email

dkakkar commented Sep 14, 2020

jakerbrown commented Sep 14, 2020 via email

dkakkar commented Sep 14, 2020 • edited Loading

dkakkar commented Sep 14, 2020

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 15, 2020 via email

dkakkar commented Sep 15, 2020 via email

jakerbrown commented Sep 16, 2020 via email

dkakkar commented Sep 16, 2020 via email

dkakkar commented Sep 16, 2020 via email

jakerbrown commented Sep 16, 2020 via email

dkakkar commented Sep 16, 2020 via email

dkakkar commented Sep 14, 2020 •

edited

Loading